In my last stage of my thesis, i have to compute 3D bounding boxes through two ip cameras using yolo. I have already found my exterior orientation with cv2.solvePnp() using some ground truth values ( measured values of points on the wall of the room ) and done calibration with rational model enabled because my cameras are wide lens. All i am asking is, through the whole process, do i have to use cv2.getOptimalNewCameraMatrix() on cv2.undistort() in order to correctly continue through my 3d bounding box or not? Because my results with non using it, with parameter alpha = 0 and alpha = 1 seems quite correct to me.
Related
I need to find the intrinsic parameters of a CCTV camera using a set of historic footage images (That is all I got, no control on the environment, thus no chessboard calibration).
The good news is that I have the access to some ground-truth real-world coordinates, visible in most of the images.
Just wondering if there is any solid approach to come up with the camera intrinsic parameters.
P.S. I already found the homography matrix using cv2.findHomography in Python.
P.S. I have already tested QTcalib on two machines, but it is unable to visualize the images in the first place. Not sure what is wrong with it.
Thanks in advance.
intrinsic parameters contain both fx fy cx cy and skew with additional distortion parameters k1-k5 r1-r2.
Assuming you have no distortion and cx and cy are perfectly in the center. Image origin at top left as a normal understanding of the image. As you say you know some ground truth level 3D points.3D measurements are with respect to camera optical axis. Then this 3D point P can be projected into camera image plane called p. The P p O(the camera optical center) with center lines forms isosceles triangle.
fx / (p_x-cx) = P_z / P_x
fx = (p_x-cx) * P_z / P_x
The same goes for the fy. and usually fx and fy are the same.
This is under the perfect assumption that you don't have distortion on camera. If you start to have distortion, then you need to find enough sample points all over the image to form distortion understanding as shown below. One or 2 points won't give you the whole picture understanding.
There are some cheats in some papers that using sea vanishing lines(see ref, it is a series of works) or perfect 3D building vanishing points to detect the distortion. We start from extrinsic to intrinsic and it can get some good guess after some trial eventually. But it is very much in research and can not apply to general cases.
Ref: Han Wang, Wei Mou, Xiaozheng Mou, Shenghai Yuan, Soner Ulun, Shuai Yang and Bok-Suk Shin, An Automatic Self-Calibration Approach for Wide Baseline Stereo Cameras Using Sea Surface Images, unmanned system
If all you have is a video and a few 3d points, your best bet is probably to matchmove it, that is, do a manually assisted bundle adjustment using a 3D computer graphics environment, e.g. Blender. There are a lot of tutorials online on how to do it (example). To add the 3d points as constraints, you build some shapes representing them in the virtual world (e.g. some small spheres) and place them so that their relative positions match the ground truth you have, then add them to the tracker solution.
I have an array of data from a grayscale image that I have segmented sets of contiguous points of a certain intensity value from.
Currently I am doing a naive bounding box routine where I find the minimum and maximum (x,y) [row, col] points. This obviously does not provide the smallest possible box that contains the set of points which is demonstrable by simply rotating a rectangle so the longest axis is no longer aligned with a principal axis.
What I wish to do is find the minimum sized oriented bounding box. This seems to be possible using an algorithm known as rotating calipers, however the implementations of this algorithm seem to rely on the idea that you have a set of vertices to begin with. Some details on this algorithm: https://www.geometrictools.com/Documentation/MinimumAreaRectangle.pdf
My main issue is in finding the vertices within the data that I currently have. I believe I need to at least find candidate vertices in order to reduce the amount of iterations I am performing, since the amount of points is relatively large and treating the interior points as if they are vertices is unnecessary if I can figure out a way to not include them.
Here is some example data that I am working with:
Here's the segmented scene using the naive algorithm, where it segments out the central objects relatively well due to the objects mostly being aligned with the image axes:
.
In red, you can see the current bounding boxes that I am drawing utilizing 2 vertices: top-left and bottom-right corners of the groups of points I have found.
The rotation part is where my current approach fails, as I am only defining the bounding box using two points, anything that is rotated and not axis-aligned will occupy much more area than necessary to encapsulate the points.
Here's an example with rotated objects in the scene:
Here's the current naive segmentation's performance on that scene, which is drawing larger than necessary boxes around the rotated objects:
Ideally the result would be bounding boxes aligned with the longest axis of the points that are being segmented, which is what I am having trouble implementing.
Here's an image roughly showing what I am really looking to accomplish:
You can also notice unnecessary segmentation done in the image around the borders as well as some small segments, which should be removed with some further heuristics that I have yet to develop. I would also be open to alternative segmentation algorithm suggestions that provide a more robust detection of the objects I am interested in.
I am not sure if this question will be completely clear, therefore I will try my best to clarify if it is not obvious what I am asking.
It's late, but that might still help. This is what you need to do:
expand pixels to make small segments connect larger bodies
find connected bodies
select a sample of pixels from each body
find the MBR ([oriented] minimum bounding rectangle) for selected set
For first step you can perform dilation. It's somehow like DBSCAN clustering. For step 3 you can simply select random pixels from a uniform distribution. Obviously the more pixels you keep, the more accurate the MBR will be. I tested this in MATLAB:
% import image as a matrix of 0s and 1s
oI = ~im2bw(rgb2gray(imread('vSb2r.png'))); % original image
% expand pixels
dI = imdilate(oI,strel('disk',4)); % dilated
% find connected bodies of pixels
CC = bwconncomp(dI);
L = labelmatrix(CC) .* uint8(oI); % labeled
% mark some random pixels
rI = rand(size(oI))<0.3;
sI = L.* uint8(rI) .* uint8(oI); % sampled
% find MBR for a set of connected pixels
for i=1:CC.NumObjects
[Y,X] = find(sI == i);
mbr(i) = getMBR( X, Y );
end
You can also remove some ineffective pixels using some more processing and morphological operations:
remove holes
find boundaries
find skeleton
In MATLAB:
I = imfill(I, 'holes');
I = bwmorph(I,'remove');
I = bwmorph(I,'skel');
Assuming the static scene, with a single camera moving exactly sideways at small distance, there are two frames and a following computed optic flow (I use opencv's calcOpticalFlowFarneback):
Here scatter points are detected features, which are painted in pseudocolor with depth values (red is little depth, close to the camera, blue is more distant). Now, I obtain those depth values by simply inverting optic flow magnitude, like d = 1 / flow. Seems kinda intuitive, in a motion-parallax-way - the brighter the object, the closer it is to the observer. So there's a cube, exposing a frontal edge and a bit of a side edge to the camera.
But then I'm trying to project those feature points from camera plane to the real-life coordinates to make a kind of top view map (where X = (x * d) / f and Y = d (where d is depth, x is pixel coordinate, f is focal length, and X and Y are real-life coordinates). And here's what I get:
Well, doesn't look cubic to me. Looks like the picture is skewed to the right. I've spent some time thinking about why, and it seems that 1 / flow is not an accurate depth metric. Playing with different values, say, if I use 1 / power(flow, 1 / 3), I get a better picture:
But, of course, power of 1 / 3 is just a magic number out of my head. The question is, what is the relationship between optic flow in depth in general, and how do I suppose to estimate it for a given scene? We're just considering camera translation here. I've stumbled upon some papers, but no luck trying to find a general equation yet. Some, like that one, propose a variation of 1 / flow, which isn't going to work, I guess.
Update
What bothers me a little is that simple geometry points me to 1 / flow answer too. Like, optic flow is the same (in my case) as disparity, right? Then using this formula I get d = Bf / (x2 - x1), where B is distance between two camera positions, f is focal length, x2-x1 is precisely the optic flow. Focal length is a constant, and B is constant for any two given frames, so that leaves me with 1 / flow again multiplied by a constant. Do I misunderstand something about what optic flow is?
for a static scene, moving a camera precisely sideways a known amount, is exactly the same as a stereo camera setup. From this, you can indeed estimate depth, if your system is calibrated.
Note that calibration in this sense is rather broad. In order to get real accurate depth, you will need to in the end supply a scale parameter on top of the regular calibration stuff you have in openCV, or else there is a single uniform ambiguity of the 3D (This last step is often called going to the "metric" reconstruction from only the "Euclidean").
Another thing which is apart of broad calibration is lens distortion compensation. Before anything else, you probably want to force your cameras to behave like pin-hole cameras (which real-world cameras usually dont).
With that said, optical flow is definetely very different from a metric depth map. If you properly calibraty and rectify your system first, then optical flow is still not equivalent to disparity estimation. If your system is rectified, there is no point in doing a full optical flow estimation (such as Farnebäck), because the problem is thereafter constrained along the horizontal lines of the image. Doing a full optical flow estimation (giving 2 d.o.f) will introduce more error after said rectification likely.
A great reference for all this stuff is the classic "Multiple View Geometry in Computer Vision"
I am working on a project to detect the 3D location of the object. I have two cameras set up at two corners of the room and I have obtained the Fundamental matrix between them. These cameras are internally calibrated. My images are 2592 X 1944
K = [1228 0 3267
0 1221 538
0 0 1 ]
F = [-1.098e-7 3.50715e-7 -0.000313
2.312e-7 2.72256e-7 4.629e-5
0.000234 -0.00129250 1 ]
Now, How do I proceed so that given a 3D point in space, I should be able to get points on the image which correspond to the same object in the room. If I can obtain the right projection matrices (with correct scale) I can use them later as inputs to OpenCV's traingulatePoints function to obtain the location of the object.
I have been stuck at this since a long time. So, please help me.
Thanks.
From what I gather, you have obtained the Fundamental matrix through some means of calibration? Either way, with the fundamental matrix (or the calibration rig itself) you can obtain the pose difference via decomposition of the Essential matrix. Once you have that, you can use matched feature points (using a feature extractor and descriptor like SURF, BRISK, ...) to identify which points in one image belong to the same object point as another feature point in the other image.
With that information, you should be able to triangulate away.
Sorry its not coming in size of comment..
so #user2167617 reply to your comment.
Pretty much. A few pointers, though: the singular values should be (s,s,0), so (1.3, 1.05, 0) is a pretty good guess. About the R: Technically, this is right, however, ignoring signs. It might very well be that you get a rotation matrix which does not satisfy the constraint deteminant(R) = 1 but is instead -1. You might want to multiply it with -1 in that case. Generally, if you run into problems with this approach, try to determine the Essential Matrix using the 5 point algorithm (implemented into the very newest version of OpenCV, you will have to build it yourself). The scale is indeed impossible to obtain with these informations. However, it's all to scale. If you define for example the distance between the cameras being 1 unit, then everything will be measured in that unit.
May be it will be simplier use cv::reprojectImageTo3D function? It will give you 3D coordinates.
Is there a way to calculate the distance to specific object using stereo camera?
Is there an equation or something to get distance using disparity or angle?
NOTE: Everything described here can be found in the Learning OpenCV book in the chapters on camera calibration and stereo vision. You should read these chapters to get a better understanding of the steps below.
One approach that do not require you to measure all the camera intrinsics and extrinsics yourself is to use openCVs calibration functions. Camera intrinsics (lens distortion/skew etc) can be calculated with cv::calibrateCamera, while the extrinsics (relation between left and right camera) can be calculated with cv::stereoCalibrate. These functions take a number of points in pixel coordinates and tries to map them to real world object coordinates. CV has a neat way to get such points, print out a black-and-white chessboard and use the cv::findChessboardCorners/cv::cornerSubPix functions to extract them. Around 10-15 image pairs of chessboards should do.
The matrices calculated by the calibration functions can be saved to disc so you don't have to repeat this process every time you start your application. You get some neat matrices here that allow you to create a rectification map (cv::stereoRectify/cv::initUndistortRectifyMap) that can later be applied to your images using cv::remap. You also get a neat matrix called Q, which is a disparity-to-depth matrix.
The reason to rectify your images is that once the process is complete for a pair of images (assuming your calibration is correct), every pixel/object in one image can be found on the same row in the other image.
There are a few ways you can go from here, depending on what kind of features you are looking for in the image. One way is to use CVs stereo correspondence functions, such as Stereo Block Matching or Semi Global Block Matching. This will give you a disparity map for the entire image which can be transformed to 3D points using the Q matrix (cv::reprojectImageTo3D).
The downfall of this is that unless there is much texture information in the image, CV isn't really very good at building a dense disparity map (you will get gaps in it where it couldn't find the correct disparity for a given pixel), so another approach is to find the points you want to match yourself. Say you find the feature/object in x=40,y=110 in the left image and x=22 in the right image (since the images are rectified, they should have the same y-value). The disparity is calculated as d = 40 - 22 = 18.
Construct a cv::Point3f(x,y,d), in our case (40,110,18). Find other interesting points the same way, then send all of the points to cv::perspectiveTransform (with the Q matrix as the transformation matrix, essentially this function is cv::reprojectImageTo3D but for sparse disparity maps) and the output will be points in an XYZ-coordinate system with the left camera at the center.
I am still working on it, so I will not post entire source code yet. But I will give you a conceptual solution.
You will need the following data as input (for both cameras):
camera position
camera point of interest (point at which camera is looking)
camera resolution (horizontal and vertical)
camera field of view angles (horizontal and vertical)
You can measure the last one yourself, by placing the camera on a piece of paper and drawing two lines and measuring an angle between these lines.
Cameras do not have to be aligned in any way, you only need to be able to see your object in both cameras.
Now calculate a vector from each camera to your object. You have (X,Y) pixel coordinates of the object from each camera, and you need to calculate a vector (X,Y,Z). Note that in the simple case, where the object is seen right in the middle of the camera, the solution would simply be (camera.PointOfInterest - camera.Position).
Once you have both vectors pointing at your target, lines defined by these vectors should cross in one point in ideal world. In real world they would not because of small measurement errors and limited resolution of cameras. So use the link below to calculate the distance vector between two lines.
Distance between two lines
In that link: P0 is your first cam position, Q0 is your second cam position and u and v are vectors starting at camera position and pointing at your target.
You are not interested in the actual distance, they want to calculate. You need the vector Wc - we can assume that the object is in the middle of Wc. Once you have the position of your object in 3D space you also get whatever distance you like.
I will post the entire source code soon.
I have the source code for detecting human face and returns not only depth but also real world coordinates with left camera (or right camera, I couldn't remember) being origin. It is adapted from source code from "Learning OpenCV" and refer to some websites to get it working. The result is generally quite accurate.