I am developing a feature tracking application and so far, after trying to almost all the feature detectors/descriptors, i've got the most satisfactory overall results with ORB.
Both my feature descriptor and detector is ORB.
I am selecting a specific area for detecting features on my source image (by masking). and then matching it with features detected on subsequent frames.
Then i filter my matches by performing ratio test on 'matches' obtained from the following code:
std::vector<std::vector<DMatch>> matches1;
m_matcher.knnMatch( m_descriptorsSrcScene, m_descriptorsCurScene, matches1,2 );
I also tried the two way ratio test(filtering matches from Source to Current scene and vice-versa, then filtering out common matches) but it didn't do much, so I went ahead with the one way ratio test.
i also add a min distance check to my ratio test, which, it apppears, gives better results
if (distanceRatio < m_fThreshRatio && bestMatch.distance < 5*min_dist)
and in the end , i estimate the Homography.
Mat H = findHomography(points1,points2);
I've tried using the RANSAC method for estimating inliners and then using those to recalculate my Homography, but that gives more unstability plus consumes more time.
then in the end i draw a rectangle around my specific region which is to be tracked. i get the plane coordinates by:
perspectiveTransform( obj_corners, scene_corners, H);
where 'objcorners' are the coordinates of my masked(or unmasked) region.
The reactangle I draw using 'scene_corners' seems to be vibrating. increasing the number of features has reduced it quite a bit, but I cant increase them too much because of the time constraint.
How can i improve the stability?
Any suggestions would be appreciated.
If it is the vibrations that are really bothersome to you then you could try taking the moving average of the homography matrices over time:
cv::Mat homoG = cv::findHomography(obj, scene, CV_RANSAC);
if (homography.empty()) {
cv::accumulateWeighted(homoG, homography, 0.1);
Make the 'homography' variable global, and keep calling this every time you get a new frame.
The alpha parameter of accumulateWeighted is the reciprocal of the period of the moving average.
So 0.1 is taking the average of the last 10 frames and 0.2 is taking the average of the last 5 and so on...
A suggestion that comes to mind from experience with feature detection/matching is that sometimes you just have to accept the matched feature points will not work perfectly. Even subtle changes in the scene you are looking at can cause somewhat annoying problems, for example changes in light or unwanted objects coming into view.
It appears to me that you have a decently working feature matching in place from what you say, you may want to work on a way of keeping the region of interest constant. If you know the typical speed or any other movement patterns unique to any object you are trying to track between frames, or any constraints relating to the position of your camera, it may be useful in avoiding recalculating the region of interest unnecessarily causing vibrations. Or in fact it may help in creating a more efficient searching algorithm, allowing you to increase the number of feature points you can detect and use.
Another (small) hack you can use is to avoid redrawing the region window if the previous window was of similar size and position.
We have this camera array arranged in an arc around a person (red dot). Think The Matrix - each camera fires at the same time and then we create an animated gif from the output. The problem is that it is near impossible to align the cameras exactly and so I am looking for a way in OpenCV to align the images better and make it smoother.
Looking for general steps. I'm unsure of the order I would do it. If I start with image 1 and match 2 to it, then 2 is further from three than it was at the start. And so matching 3 to 2 would be more change... and the error would propagate. I have seen similar alignments done though. Any help much appreciated.
Here's a thought. How about performing a quick and very simple "calibration" of the imaging system by using a single reference point?
The best thing about this is you can try it out pretty quickly and even if results are too bad for you, they can give you some more insight into the problem. But the bad thing is it may just not be good enough because it's hard to think of anything "less advanced" than this. Here's the description:
Remove the object from the scene
Place a small object (let's call it a "dot") to position that rougly corresponds to center of mass of object you are about to record (the center of area denoted by red circle).
Record a single image with each camera
Use some simple algorithm to find the position of the dot on every image
Compute distances from dot positions to image centers on every image
Shift images by (-x, -y), where (x, y) is the above mentioned distance; after that, the dot should be located in the center of every image.
When recording an actual object, use these precomputed distances to shift all images. After you translate the images, they will be roughly aligned. But since you are shooting an object that is three-dimensional and has considerable size, I am not sure whether the alignment will be very convincing ... I wonder what results you'd get, actually.
If I understand the application correctly, you should be able to obtain the relative pose of each camera in your array using homographies:
From here, the next step would be to correct for alignment issues by estimating the transform between each camera's actual position and their 'ideal' position in the array. These ideal positions could be computed relative to a single camera, or relative to the focus point of the array (which may help simplify calculation). For each image, applying this corrective transform will result in an image that 'looks like' it was taken from the 'ideal' position.
Note that you may need to estimate relative camera pose in 3-4 array 'sections', as it looks like you have a full 180deg array (e.g. estimate homographies for 4-5 cameras at a time). As long as you have some overlap between sections it should work out.
Most of my experience with this sort of thing comes from using MATLAB's stereo camera calibrator app and related functions. Their help page gives a good overview of how to get started estimating camera pose. OpenCV has similar functionality.
The cited paper by Zhang gives a great description of the mathematics of pose estimation from correspondence, if you're interested.
I am trying to calibrate a camera using a checkerboard by the well known Zhang's method followed by bundle adjustment, which is available in both Matlab and OpenCV. There are a lot of empirical guidelines but from my personal experience the accuracy is pretty random. It could sometimes be really good but also sometimes really bad. The result actually can vary quite a bit just by simply placing the checkerboard at different locations. Suppose the target camera is rectilinear with 110 degree horizontal FOV.
Does the number of squares in the checkerboard affect the accuracy? Zhang uses 8x8 in his original paper without really explaining why.
Does the length of the square affect the accuracy? Zhang uses 17cm x 17cm without really explaining why.
What is the optimal number of snap shots of different checkerboard position/orientation? Zhang uses 5 images only. I saw people suggesting 20~30 images with checkerboards at various angles, fills the entire field of view, tilted to the left, right, top and bottom, and suggested there should be no checkerboard placed at similar position/orientation otherwise the result will be biased towards that position/orientation. Is this correct?
The goal is to figure out a workflow to get consistent calibration result.
If the accuracy you get is "pretty random" then you are likely not doing it right: with stable optics and a well conducted procedure you should consistently be getting RMS projection errors within a few tenths of a pixel. Whether this corresponds to variances of millimeters or meters in 3D space depends, of course, on your optics and sensor resolution (calibration is not a way around physics).
I wrote time ago a few suggestions in this answer, and I recommend you follow them. In particular, pay attention to locking the focus distance (I have seen & heard countless people trying to calibrate a camera on autofocus, and be sorely disappointed). As for the size of the target, again it depends on your optics and camera resolution, but generally speaking the goals are (1) to fill with measurements both the field of view and the volume of space you'll be working with, and (2) to observe significant perspective foreshortening, because that is what constrains the solution for the FOV. Good luck!
[Ed.to address a comment]
Concerning variations on the parameter values across successive calibrations, the first thing I'd do is calculate the cross RMS errors, i.e. the RMS error on dataset 1 with the camera calibrated on dataset 2, and vice versa. If either is significantly higher than the calibration errors, it's an indication that the camera has changed between the two calibrations and so all odds are off. Do you have auto-{focus,iris,zoom,stabilization} on? Turn them all off: auto-anything is the bane of calibration, with the only exception of exposure time. Otherwise, you need to see if the variations you observe on the parameters are actually meaningful (hint, they often are not). A variation of the focal length in pixels of several parts per thousand is probably irrelevant with today's sensor resolutions - you can verify that by expressing it in mm, and comparing it to the dot pitch of the sensor. Also, variations of the position of the principal point in the order of tens of pixels are common, since it is poorly observed unless your calibration procedure is very carefully designed to estimate it.
Ideally you want to place your checkerboard at roughly the same distance from the the camera, as the distance at which you want to do your measurements. So your checkerboard squares must be large enough to be resolvable from that distance. You also do want to cover the entire field of view with points, especially close to the edges and corners of the frame. Also, the smaller the board is, the more images you should take to cover the entire field of view. So 20-30 images is usually a good rule of thumb.
Another thing is that the checkerboard should be asymmetric. Ideally, you want to have an even number of squares along one side, and an odd number of squares along the other. This way the board's in-plane orientation is unambiguous.
Also, I would suggest that you try the Camera Calibrator app in MATLAB. At the very least, look at the documentation, which has a lot of useful suggestions for calibrating cameras.
I did a lot of experiment using the accelerometer for detecting the movement size(magnitude) just one value from x,y,z acceleration. I am using an iPhone 4 with accelerometer update frequency 1.0 / 50.0 (50HZ), but I've also tried with 100HZ, 150HZ, 200HZ.
Acceleration on X axis
Acceleration on Y axis
Acceleration on Z axis
I assume ( I hope I am correct) that the accelerations are the small peaks on the graph, not the big steps. I think from my experiments that the big steps show the device position. If changed the position the step is changed too.
If my previous assumption is correct I need to cut the peaks from the graph and summarize them. Here comes my question how can I cut those peaks without losing the information, the peak sizes.
I know that the high pass filter does this kind of thinks(passes the high peaks and blocks the noise, the small ones, I've read some paper about the filters. But for me the filter cut a lot of information from my "signal"(accelerometer data).
I think that there should be a better way for getting the information out from the data.
I've tried a simple one which looks nice but it isn't correct.
I did this data data using my function magnitude
for i = 2 : length(x)
converted(i-1) = x(i-1) - x(i);
Where x is my data and converted array is the result.
The following row generated a the image below, which looks like nice.
xyz = magnitude(datay) + magnitude(dataz) + magnitude(datax)
However the problem with that solution is that if I have continuos acceleration the graph just will show the first point and then goes down. I know that I need somehow better filter, but I am bit confused. Could you give some advice how can I do this properly.
Thanks for your time,
I really appreciate your help
Edit(answers for Zaph question):
What are you trying to accomplish?
I want to measure the movement when the iPhone is placed to desk, chair or bed. The accelerometer is so sensible if I put down a pencil it to a desk it shows me. I want to measure all movement that happens in a specific time.
What are the scale units?
I'm not scaling the data.
When you say "device position" what do you mean, an accelerometer provides movement (in iPhones with gyros)
I am using only the accelerometer. When I put the device like the picture below I got values around -1 on x coordinate, 0.0 on z and y coordinate. This is what I mean as device position.
The measurements that are returned from the accelerometer are acceleration, not position.
I'm not sure what you mean with "big steps" but the peaks show a change of acceleration. The fact that the values are not 0 when holding the device still is from the fact that the gravitation accelerates the device with 9.81 m/s^2 (the magnitude of the acceleration vector).
You are potentially trying to do something quite difficult, especially the with low quality sensors that are embedded in phones. That is, getting the actual coordinate acceleration of the phone.
What you can do, is to detect the time periods when the phone was moved or touched. You can first calculate magnitude (norm) of acceleration signal and then, with a moving window, check areas where sample standard deviation is smaller than some threshold. Determining how the phone moved is more complicated issue. Of course you can check orientation for the stationary areas between movements.
I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.
I need to detect corners on grayscale images with the highest possible accuracy. Currently I am using the OpenCV function: cvFindCornerSubPix().
I prepared a simple test: got an image with a corner of black/white edges:
and then a series of the same object, moved 1/16pixel each. I did check pixel values manually, test images are just fine.
Detection results were disappointing:
Even though in TermCrit the condition is set to 100 iterations or 0.005 threshold, detection error gets as big as 0.08 pixel.
The graph shows error as a function of position within a pixel. Does not look random at all. Another thing worth a note: for other anular positions of the corner (when edges are not horizontal/vertical) results are better, but still not perfect.
Any idea, how to make this function work properly, why it does not, or what to use instead?
I would greatly appreciate any advice
Less than 10% of a pixel really isn't bad performance at all. For reference, a correlation peak detector suitable for the production of 3D model from satellite images will have the same order of magnitude of error.
As pointed out in the comments, the exact error pattern will depend on the interpolation method that you use to generate the subpixel pattern. In order to avoid the non-monotonicity introduced by higher-order interpolation methods (beyond order 2), I would suggest the following protocol:
Generate you input image in a high-res, 16 times bigger;
Move your target by 1 pixel at a time in this HR image;
Produce your test images by downsampling (careful: apply an appropriate blurring function like a PSF first if you go for brutal downsampling in order to avoid aliasing) to the correct size.
Finally, it is often not desirable to go to a smaller error magnitude. The subpixel corner detector was designed to be used in images where many (typically between 20 and 100) points are detected. These points are then used in a robust estimation process that should remove outliers and average the error on the valid point sets.