I have a webcam mounted ~12 inches off a table facing down. I have a sheet of paper under it that can move in any direction, but only in a 2D plan on the table. I want to use the webcam to figure out in which direction the sheet of paper is moving. Is there an algorithm to do this? What is it called?
I suggest optical flow. From Wikipedia:
Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene
Or, to quote a presentation from the Stanford Artificial Intelligence Lab:
Given a set of points in an image, find those same points in another image.
It means you can compute the displacement of a set of points belonging to the object you want to track from one image to another -> resulting in a set of vectors describing the direction of your object.
Find good image features to track using cvGoodFeaturesToTrack() -- you should get good results as long as your sheet of paper is distinct from the table
Find its corners using cvFindCornerSubPix() and
Compute the optical flow using cvCalcOpticalFlowPyrLK() -- "LK" means "Lucas-Kanade", the name of the algorithm
See OpenCV's Motion Analysis and Object Tracking documentation for details.
Related
I am asked to create calibration for the eye-tracking algorithm. However, I still don't really understand about how does the calibration helps in making our gaze estimation more accurate, as well as how calibration in eye-tracking actually works. I have read https://www.tobiidynavox.com/support-training/eye-tracker-calibration/, as well as https://developer.tobii.com/community/forums/topic/explain-calibration/, but I still don't fully understand it. I will appreciate if somebody can explain it to me.
Thank you
In the answer below, I assume that you are referring to standard pupil-centre corneal-reflection video-oculography rather than any other form of eye tracking technology.
In eye tracking, calibration is the process that transforms the coordinates of features located in a two dimensional still video frame of the eye into gaze coordinates (i.e. coordinates that are related to the world being observed). For example, let's say your eye tracker produces a 400 × 400 pixel image of the eye, and the subject is looking at a screen that is 1024 × 768 pixels in size, some distance in front of them. The calibration process needs to relate coordinates in the eye image to where the person is looking (i.e. gazing) at on the display screen. This process is not trivial: just because the pupil is centred in the eye image does not mean that the person is looking at the centre of the display in the world, for example. And the position of the pupil centre could move within the eye image even though the direction of gaze is held constant in the world. This is why we track the centre of the pupil and the corneal reflection, as the vector linking the two is robust to translation of the eye within the image that occurs in the absence of a gaze rotation.
A standard way to do this mapping is via relatively straightforward 2D non-linear regression: you move a target at known coordinates on the display and ask the participant to fixate steadily on each, while recording the location of the pupil centre and corneal reflection in the eye image. The calibration process will map the vector linking the pupil centre and the corneal reflection to the corresponding known gaze coordinates. This produces a regression solution that allows you to map intermediate locations to their interpolated gaze coordinates.
(An alternative, or supplementary, approach is model-based rather than regression-based, but let's not go there right now.)
So in essence, calibration doesn't improve gaze estimation, it provides gaze estimation. Without first doing a calibration, all you are doing is tracking the movements of features (the pupil and corneal reflection) within a relatively arbitrary image of the eye. Until calibration is carried out, you have no idea at that stage where that eye is actually pointing in the world.
Having said all that, this is not at all a coding-based question (or answer), so not actually sure that StackOverflow is the ideal venue to be asking this.
I'm currently working on an augmented reality application using a medical imaging program called 3DSlicer. My application runs as a module within the Slicer environment and is meant to provide the tools necessary to use an external tracking system to augment a camera feed displayed within Slicer.
Currently, everything is configured properly so that all that I have left to do is automate the calculation of the camera's extrinsic matrix, which I decided to do using OpenCV's solvePnP() function. Unfortunately this has been giving me some difficulty as I am not acquiring the correct results.
My tracking system is configured as follows:
The optical tracker is mounted in such a way that the entire scene can be viewed.
Tracked markers are rigidly attached to a pointer tool, the camera, and a model that we have acquired a virtual representation for.
The pointer tool's tip was registered using a pivot calibration. This means that any values recorded using the pointer indicate the position of the pointer's tip.
Both the model and the pointer have 3D virtual representations that augment a live video feed as seen below.
The pointer and camera (Referred to as C from hereon) markers each return a homogeneous transform that describes their position relative to the marker attached to the model (Referred to as M from hereon). The model's marker, being the origin, does not return any transformation.
I obtained two sets of points, one 2D and one 3D. The 2D points are the coordinates of a chessboard's corners in pixel coordinates while the 3D points are the corresponding world coordinates of those same corners relative to M. These were recorded using openCV's detectChessboardCorners() function for the 2 dimensional points and the pointer for the 3 dimensional. I then transformed the 3D points from M space to C space by multiplying them by C inverse. This was done as the solvePnP() function requires that 3D points be described relative to the world coordinate system of the camera, which in this case is C, not M.
Once all of this was done, I passed in the point sets into solvePnp(). The transformation I got was completely incorrect, though. I am honestly at a loss for what I did wrong. Adding to my confusion is the fact that OpenCV uses a different coordinate format from OpenGL, which is what 3DSlicer is based on. If anyone can provide some assistance in this matter I would be exceptionally grateful.
Also if anything is unclear, please don't hesitate to ask. This is a pretty big project so it was hard for me to distill everything to just the issue at hand. I'm wholly expecting that things might get a little confusing for anyone reading this.
Thank you!
UPDATE #1: It turns out I'm a giant idiot. I recorded colinear points only because I was too impatient to record the entire checkerboard. Of course this meant that there were nearly infinite solutions to the least squares regression as I only locked the solution to 2 dimensions! My values are much closer to my ground truth now, and in fact the rotational columns seem correct except that they're all completely out of order. I'm not sure what could cause that, but it seems that my rotation matrix was mirrored across the center column. In addition to that, my translation components are negative when they should be positive, although their magnitudes seem to be correct. So now I've basically got all the right values in all the wrong order.
Mirror/rotational ambiguity.
You basically need to reorient your coordinate frames by imposing the constraints that (1) the scene is in front of the camera and (2) the checkerboard axes are oriented as you expect them to be. This boils down to multiplying your calibrated transform for an appropriate ("hand-built") rotation and/or mirroring.
The basic problems is that the calibration target you are using - even when all the corners are seen, has at least a 180^ deg rotational ambiguity unless color information is used. If some corners are missed things can get even weirder.
You can often use prior info about the camera orientation w.r.t. the scene to resolve this kind of ambiguities, as I was suggesting above. However, in more dynamical situation, of if a further degree of automation is needed in situations in which the target may be only partially visible, you'd be much better off using a target in which each small chunk of corners can be individually identified. My favorite is Matsunaga and Kanatani's "2D barcode" one, which uses sequences of square lengths with unique crossratios. See the paper here.
I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.
I have a set of 3-d points and some images with the projections of these points. I also have the focal length of the camera and the principal point of the images with the projections (resulting from previously done camera calibration).
Is there any way to, given these parameters, find the automatic correspondence between the 3-d points and the image projections? I've looked through some OpenCV documentation but I didn't find anything suitable until now. I'm looking for a method that does the automatic labelling of the projections and thus the correspondence between them and the 3-d points.
The question is not very clear, but I think you mean to say that you have the intrinsic calibration of the camera, but not its location and attitude with respect to the scene (the "extrinsic" part of the calibration).
This problem does not have a unique solution for a general 3d point cloud if all you have is one image: just notice that the image does not change if you move the 3d points anywhere along the rays projecting them into the camera.
If have one or more images, you know everything about the 3D cloud of points (e.g. the points belong to an object of known shape and size, and are at known locations upon it), and you have matched them to their images, then it is a standard "camera resectioning" problem: you just solve for the camera extrinsic parameters that make the 3D points project onto their images.
If you have multiple images and you know that the scene is static while the camera is moving, and you can match "enough" 3d points to their images in each camera position, you can solve for the camera poses up to scale. You may want to start from David Nister's and/or Henrik Stewenius's papers on solvers for calibrated cameras, and then look into "bundle adjustment".
If you really want to learn about this (vast) subject, Zisserman and Hartley's book is as good as any. For code, look into libmv, vxl, and the ceres bundle adjuster.
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/