How to normalize position of the elements on the picture [OpenCV] - opencv

I am currently working on program which could help at my work. I'm trying to use Machine Learning for the classification purpose. The problem is that I don't have enough samples for training the model and augmentation is something I'm trying to avoid because hardware problems (not enough RAM) either on my company laptop and on the Google Collab. So I decided to try to somehow normalize the position of the elements so the differences would be visible for the machine even with no big amount of different samples. Unfortunately now I'm struggling how to normalize those pictures.
Element 1a:
Element 1b:
Element 2a:
Element 2b:
Elements 1a and 1b are the same type and 2a - 2b are the same type. Is there a way to somehow normalize position for those pictures (something like position 0) which would help the algorithm to see differences between them? I've tried using cv2.minAreaSquare to get the square position, rotating them and cropping don't needed area but unfortunately those elements can have different width so after scaling them down the contours are deformed unevenly. Then I was trying to get symmetry axis and using this to do a proper cropping after rotation but still the results didn't meet my expectations. I was thinking to add more normalization points like this:
Normalization Points:
And using this points normalize position of the rest of my elements but Perspective Transform takes only 4 points and with 4 points its also not very good methodology. Maybe you guys know a way how to move those elements to have them in the same positions.

Seeing the images, I believe that the transformation between two pictures is either an isometry (translation + rotation) or a similarity (translation + rotation + scaling). These can be determined with just two points. (Perspective takes four points but I think that this is overkill.)
But for good accuracy, you must make sure that the points are found reliably and precisely. In the first place, you need to guess which features of the shapes are repeatable from one sample to the next.
For example, you might estimate that the straight edges are always in the same relative position. In such a case, I would recommend finding two points on some edges, drawing a line between them and find intersections between the lines.
In the illustration, you find edge points along the red profiles, and from them you draw the green lines. They intersect in the yellow points.
For increased accuracy, you can use a least-squares approach to find a best fit on more than two points.

Related

Matching bounding boxes from multiple viewing angles

I have mounted two cameras on different shelves of a fridge (at the bottom and top), facing each other. An object detector is fed with two images from these two streams and returns bounding boxes from them, independently.
The problem: Given bounding boxes from different viewing angles, determine their correspondence.
I know that since depth is unknown, the x, y coordinates in one camera may correspond to multiple positions in the other. This is why our solutions hitherto have been approximations that work with varying success. Naive solution has been to ignore any difference between the camera coordinates and use the Euclidean distance to get correspondence.
Another solution is to use the Fundamental Matrix, which gives a way to calculate correspondence based on epipolar geometry. This solution may be quite tedious because it requires a form of calibration and the results have not been great. It may be due to the fact that my calibration was sloppy.
The last solution, which works poorly, is to use edge och keypoint detection and match them. Since the edges, shape etc. differs quite radically, we understand why this is so.
Ultimately, I was wondering how you would have tackled the problem or if there are anything you can point me towards to get a more robust solutions.
In the left image we have the bottom view. This image has been mirrored about y-axis. The right image is from the top shelf. The object detector has detected two bounding boxes for each viewing angle, a total of four bounding boxes. These are drawn on both cameras. For example, bounding box with id 16 is detected on the top-positioned camera and is also drawn on the bottom camera to indicate its displacement. How would you go about in matching the bounding boxes which belong to the same object viewed from another viewing angle?

How to 3d reconstruct robustly from multiple images with known poses in OpenCV

The traditional solution for high resolution images examples :
extract features (dense) for all images
match features to find tracks through images
triangulate features to 3d points.
I can give two problem here for my case (many 640*480 images with small movements between each others) , first: matching is very slow , especially if the number of images is big, so a better solution can be optical flow tracking.., but it's getting sparse with big moves, ( a mix could solve the problem !!)
second: triangulate tracks , though it is over-determined problem, I find it hard to code a solution, .. (here am asking for simplifying what I read in references )
I searched quite a bit for libraries in that direction, with no useful result.
again, I have ground truth camera matrices and need only 3d positions as first estimate (without BA),
A coded software solution can be of great help as I don't need to reinvent the wheel, though a detailed instructions maybe helpful
this basically shows the underlying geometry for estimating the depth.
As you said, we have camera pose Q, and we are picking a point X from world, X_L is it's projection on left image, now, with Q_L, Q_R and X_L, we are able to make up this green colored epipolar plane, the rest job is easy, we search through points on line (Q_L, X), this line exactly describe the depth of X_L, with different assumptions: X1, X2,..., we can get different projections on the right image
Now we compare the pixel intensity difference from X_L and the reprojected point on right image, just pick the smallest one and that corresponding depth is exactly what we want.
Pretty easy hey? Truth is it's way harder, image is never strictly convex:
This makes our matching extremely hard, since the non-convex function will result any distance function have multiple critical points (candidate matches), how do you decide which one is the correct one?
However, people proposed path based match to handle this problem, methods like: SAD, SSD, NCC, they are introduced to create the distance function as convex as possible, still, they are unable to handle large scale repeated texture problem and low texture problem.
To solve this, people start to search over a long range in the epipolar line, and suddenly found that we can describe this whole distribution of matching metrics into a distance along the depth.
The horizontal axis is depth, and the vertical axis is matching metric score, and this illustration lead us found the depth filter, and we usually describe this distribution with gaussian, aka, gaussian depth filter, and use this filter to discribe the uncertainty of depth, combined with the patch matching method, we can roughly get a proposal.
Now what, let's use some optimization tools, like GN or gradient descent to finally refine the depth estimaiton.
To sum up, the total process of the depth estimation is like the following steps:
assume all depth in all pixel following a initial gaussian distribution
start search through epipolar line and reproject points into target frame
triangulate depth and calculate the uncertainty of the depth from depth filter
run 2 and 3 again to get a new depth distribution and merge with previous one, if they converged then break, ortherwise start again from 2.

OpenCV - align stack of images - different cameras

We have this camera array arranged in an arc around a person (red dot). Think The Matrix - each camera fires at the same time and then we create an animated gif from the output. The problem is that it is near impossible to align the cameras exactly and so I am looking for a way in OpenCV to align the images better and make it smoother.
Looking for general steps. I'm unsure of the order I would do it. If I start with image 1 and match 2 to it, then 2 is further from three than it was at the start. And so matching 3 to 2 would be more change... and the error would propagate. I have seen similar alignments done though. Any help much appreciated.
Here's a thought. How about performing a quick and very simple "calibration" of the imaging system by using a single reference point?
The best thing about this is you can try it out pretty quickly and even if results are too bad for you, they can give you some more insight into the problem. But the bad thing is it may just not be good enough because it's hard to think of anything "less advanced" than this. Here's the description:
Remove the object from the scene
Place a small object (let's call it a "dot") to position that rougly corresponds to center of mass of object you are about to record (the center of area denoted by red circle).
Record a single image with each camera
Use some simple algorithm to find the position of the dot on every image
Compute distances from dot positions to image centers on every image
Shift images by (-x, -y), where (x, y) is the above mentioned distance; after that, the dot should be located in the center of every image.
When recording an actual object, use these precomputed distances to shift all images. After you translate the images, they will be roughly aligned. But since you are shooting an object that is three-dimensional and has considerable size, I am not sure whether the alignment will be very convincing ... I wonder what results you'd get, actually.
If I understand the application correctly, you should be able to obtain the relative pose of each camera in your array using homographies:
https://docs.opencv.org/3.4.0/d9/dab/tutorial_homography.html
From here, the next step would be to correct for alignment issues by estimating the transform between each camera's actual position and their 'ideal' position in the array. These ideal positions could be computed relative to a single camera, or relative to the focus point of the array (which may help simplify calculation). For each image, applying this corrective transform will result in an image that 'looks like' it was taken from the 'ideal' position.
Note that you may need to estimate relative camera pose in 3-4 array 'sections', as it looks like you have a full 180deg array (e.g. estimate homographies for 4-5 cameras at a time). As long as you have some overlap between sections it should work out.
Most of my experience with this sort of thing comes from using MATLAB's stereo camera calibrator app and related functions. Their help page gives a good overview of how to get started estimating camera pose. OpenCV has similar functionality.
https://www.mathworks.com/help/vision/ug/stereo-camera-calibrator-app.html
The cited paper by Zhang gives a great description of the mathematics of pose estimation from correspondence, if you're interested.

Vehicle segmentation and tracking

I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.

Shape/Pattern Matching Approach in Computer Vision

I am currently facing a, in my opinion, rather common problem which should be quite easy to solve but so far all my approached have failed so I am turning to you for help.
I think the problem is explained best with some illustrations. I have some Patterns like these two:
I also have an Image like (probably better, because the photo this one originated from was quite poorly lit) this:
(Note how the Template was scaled to kinda fit the size of the image)
The ultimate goal is a tool which determines whether the user shows a thumb up/thumbs down gesture and also some angles in between. So I want to match the patterns against the image and see which one resembles the picture the most (or to be more precise, the angle the hand is showing). I know the direction in which the thumb is showing in the pattern, so if i find the pattern which looks identical I also have the angle.
I am working with OpenCV (with Python Bindings) and already tried cvMatchTemplate and MatchShapes but so far its not really working reliably.
I can only guess why MatchTemplate failed but I think that a smaller pattern with a smaller white are fits fully into the white area of a picture thus creating the best matching factor although its obvious that they dont really look the same.
Are there some Methods hidden in OpenCV I havent found yet or is there a known algorithm for those kinds of problem I should reimplement?
Happy New Year.
A few simple techniques could work:
After binarization and segmentation, find Feret's diameter of the blob (a.k.a. the farthest distance between points, or the major axis).
Find the convex hull of the point set, flood fill it, and treat it as a connected region. Subtract the original image with the thumb. The difference will be the area between the thumb and fist, and the position of that area relative to the center of mass should give you an indication of rotation.
Use a watershed algorithm on the distances of each point to the blob edge. This can help identify the connected thin region (the thumb).
Fit the largest circle (or largest inscribed polygon) within the blob. Dilate this circle or polygon until some fraction of its edge overlaps the background. Subtract this dilated figure from the original image; only the thumb will remain.
If the size of the hand is consistent (or relatively consistent), then you could also perform N morphological erode operations until the thumb disappears, then N dilate operations to grow the fist back to its original approximate size. Subtract this fist-only blob from the original blob to get the thumb blob. Then uses the thumb blob direction (Feret's diameter) and/or center of mass relative to the fist blob center of mass to determine direction.
Techniques to find critical points (regions of strong direction change) are trickier. At the simplest, you might also use corner detectors and then check the distance from one corner to another to identify the place when the inner edge of the thumb meets the fist.
For more complex methods, look into papers about shape decomposition by authors such as Kimia, Siddiqi, and Xiaofing Mi.
MatchTemplate seems like a good fit for the problem you describe. In what way is it failing for you? If you are actually masking the thumbs-up/thumbs-down/thumbs-in-between signs as nicely as you show in your sample image then you have already done the most difficult part.
MatchTemplate does not include rotation and scaling in the search space, so you should generate more templates from your reference image at all rotations you'd like to detect, and you should scale your templates to match the general size of the found thumbs up/thumbs down signs.
[edit]
The result array for MatchTemplate contains an integer value that specifies how well the fit of template in image is at that location. If you use CV_TM_SQDIFF then the lowest value in the result array is the location of best fit, if you use CV_TM_CCORR or CV_TM_CCOEFF then it is the highest value. If your scaled and rotated template images all have the same number of white pixels then you can compare the value of best fit you find for all different template images, and the template image that has the best fit overall is the one you want to select.
There are tons of rotation/scaling independent detection functions that could conceivably help you, but normalizing your problem to work with MatchTemplate is by far the easiest.
For the more advanced stuff, check out SIFT, Haar feature based classifiers, or one of the others available in OpenCV
I think you can get excellent results if you just compute the two points that have the furthest shortest path going through white. The direction in which the thumb is pointing is just the direction of the line that joins the two points.
You can do this easily by sampling points on the white area and using Floyd-Warshall.

Resources