OpenCV extrinsic camera from feature points

OpenCV extrinsic camera from feature points - opencv

How do I retrieve the rotation matrix, the translation vector and maybe some scaling factors of each camera using OpenCV when I have pictures of an object from the view of each of these cameras? For every picture I have the image coordinates of several feature points. Not all feature points are visible in all of the pictures.
I want to map the computed 3D coordinates of the feature points of the object to a slightly different object to align the shape of the second object to the first object.
I heard it is possible using cv::calibrateCamera(...) but I can't get quite through it...
Does someone have experiences with that kind of problem?

I was confronted with the same problem as you, in OpenCV. I had a stereo image pair and I wanted to computed the external parameters of the cameras and the world coordinates of all observed points. This problem has been treated here:
Berthold K. P. Horn. Relative orientation revisited. Berthold K. P. Horn. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 545 Technology ...
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.4700
However, I wasn't able to find a suitable implementation of this problem (perhaps you will find one). Due to time limitations I did not have time to understand all the maths in this paper and implement it myself, so I came up with a quick-and-dirty solution that works for me. I will explain what I did to solve it:
Assuming we have two cameras, where the first camera has external parameters RT = Matx::eye(). Now make a guess about the the rotation R of the second camera. For every pair of image points observed in both images, we compute the directions of their corresponding rays in world coordinates and store them in a 2d-array dirs (EDIT: The internal camera parameters are assumed to be known). We can do this since we assume that we know the orientation of every camera. Now we build an overdetermined linear system AC = 0 where C is the centre of the second camera. I provide you with the function to compute A:
Mat buildA(Matx<double, 3, 3> &R, Array<Vec3d, 2> dirs)
{
CV_Assert(dirs.size(0) == 2);
int pointCount = dirs.size(1);
Mat A(pointCount, 3, DataType<double>::type);
Vec3d *a = (Vec3d *)A.data;
for (int i = 0; i < pointCount; i++)
{
a[i] = dirs(0, i).cross(toVec(R*dirs(1, i)));
double length = norm(a[i]);
if (length == 0.0)
{
CV_Assert(false);
}
else
{
a[i] *= (1.0/length);
}
}
return A;
}
Then calling cv::SVD::solveZ(A) will give you the least-squares solution of norm 1 to this system. This way, you obtain the rotation and translation of the second camera. However, since I just made a guess about the rotation of the second camera, I make several guesses about its rotation (parameterized using a 3x1 vector omega from which i compute the rotation matrix using cv::Rodrigues) and then I refine this guess by solving the system AC = 0 repetedly in a Levenberg-Marquardt optimizer with numeric jacobian. It works for me but it is a bit dirty, so you if you have time, I encourage you to implement what is explained in the paper.
EDIT:
Here is the routine in the Levenberg-Marquardt optimizer for evaluating the vector of residues:
void Stereo::eval(Mat &X, Mat &residues, Mat &weights)
{
Matx<double, 3, 3> R2Ref = getRot(X); // Map the 3x1 euler angle to a rotation matrix
Mat A = buildA(R2Ref, _dirs); // Compute the A matrix that measures the distance between ray pairs
Vec3d c;
Mat cMat(c, false);
SVD::solveZ(A, cMat); // Find the optimum camera centre of the second camera at distance 1 from the first camera
residues = A*cMat; // Compute the output vector whose length we are minimizing
weights.setTo(1.0);
}
By the way, I searched a little more on the internet and found some other code that could be useful for computing the relative orientation between cameras. I haven't tried any code yet, but it seems useful:
http://www9.in.tum.de/praktika/ppbv.WS02/doc/html/reference/cpp/toc_tools_stereo.html
http://lear.inrialpes.fr/people/triggs/src/
http://www.maths.lth.se/vision/downloads/

Are these static cameras which you wish to calibrate for future use as a stereo pair? In this case you would want to use the cv::stereoCalibrate() function. OpenCV contains some sample code, one of which is stereo_calib.cpp which may be worth investigating.

Related

Strange issue with stereo triangulation: TWO valid solutions

I am currently using OpenCV for a pose estimation related work, in which I am triangulating points between pairs for reconstruction and scale factor estimation. I have encountered a strange issue while working on this, especially in the opencv functions recoverPose() and triangulatePoints().
Say I have camera 1 and camera 2, spaced apart in X, with cam1 at (0,0,0) and cam2 to the right of it (positive X). I have two arrays points1 and points2 that are the matched features between the two images. According to the OpenCV documentation and code, I have noted two points:
recoverPose() assumes that points1 belong to the camera at (0,0,0).
triangulatePoints() is called twice: one from recoverPose() to tell us which of the four R/t combinations is valid, and then again from my code, and the documentation says:
cv::triangulatePoints(P1, P2, points1, points2, points3D) : points1 -> P1 and points2 -> P2.
Hence, as in the case of recoverPose(), it is safe to assume that P1 is [I|0] and P2 is [R|t].
What I actually found: It doesn't work that way. Although my camera1 is at 0,0,0 and camera2 is at 1,0,0 (1 being up to scale), the only correct configuration is obtained if I run
recoverPose(E, points2, points1...)
triangulatePoints([I|0], [R|t], points2, points1, pts3D)
which should be incorrect, because points2 is the set from R|t, not points1. I tested an image pair of a scene in my room where there are three noticeable objects after triangulation: a monitor and two posters on the wall behind it. Here are the point clouds resulting from the triangulation (excuse the MS Paint)
If I do it the OpenCV's prescribed way: (poster points dispersed in space, weird looking result)
If I do it my (wrong?) way:
Can anybody share their views about what's going on here? Technically, both solutions are valid because all points fall in front of both cameras: and I didn't know what to pick until I rendered it as a pointcloud. Am I doing something wrong, or is it an error in the documentation? I am not that knowledgeable about the theory of computer vision so it might be possible I am missing something fundamental here. Thanks for your time!

I ran into a similar issue. I believe OpenCV defines the translation vector in the opposite way that one would expect. With your camera configuration the translation vector would be [-1, 0, 0]. It's counter intuitive, but both RecoverPose and stereoCalibrate give this translation vector.
I found that when I used the incorrect but intuitive translation vector (eg. [1, 0, 0]) I couldn't get correct results unless I swapped points1 and points2, just as you did.
I suspect the translation vector actually translate the points into the other camera coordinate system, not the vector to translate the camera poses. The OpenCV Documentation, seems to imply that this is the case:
The joint rotation-translation matrix [R|t] is called a matrix of extrinsic parameters. It is used to describe the camera motion around a static scene, or vice versa, rigid motion of an object in front of a still camera. That is, [R|t] translates coordinates of a point (X,Y,Z) to a coordinate system, fixed with respect to the camera.
Wikipedia has a nice description on Translation of Axis:
In the new coordinate system, the point P will appear to have been translated in the opposite direction. For example, if the xy-system is translated a distance h to the right and a distance k upward, then P will appear to have been translated a distance h to the left and a distance k downward in the x'y'-system

Reason for this observation is discussed in this thread
Is the recoverPose() function in OpenCV is left-handed?
Translation t is the vector from cam2 to cam1 in cam2 frame.

OpenCV: solvePnP detection problems

I've got problem with precise detection of markers using OpenCV.
I've recorded video presenting that issue: http://youtu.be/IeSSW4MdyfU
As you see I'm markers that I'm detecting are slightly moved at some camera angles. I've read on the web that this may be camera calibration problems, so I'll tell you guys how I'm calibrating camera, and maybe you'd be able to tell me what am I doing wrong?
At the beginnig I'm collecting data from various images, and storing calibration corners in _imagePoints vector like this
std::vector<cv::Point2f> corners;
_imageSize = cvSize(image->size().width, image->size().height);
bool found = cv::findChessboardCorners(*image, _patternSize, corners);
if (found) {
cv::Mat *gray_image = new cv::Mat(image->size().height, image->size().width, CV_8UC1);
cv::cvtColor(*image, *gray_image, CV_RGB2GRAY);
cv::cornerSubPix(*gray_image, corners, cvSize(11, 11), cvSize(-1, -1), cvTermCriteria(CV_TERMCRIT_EPS+ CV_TERMCRIT_ITER, 30, 0.1));
cv::drawChessboardCorners(*image, _patternSize, corners, found);
}
_imagePoints->push_back(_corners);
Than, after collecting enough data I'm calculating camera matrix and coefficients with this code:
std::vector< std::vector<cv::Point3f> > *objectPoints = new std::vector< std::vector< cv::Point3f> >();
for (unsigned long i = 0; i < _imagePoints->size(); i++) {
std::vector<cv::Point2f> currentImagePoints = _imagePoints->at(i);
std::vector<cv::Point3f> currentObjectPoints;
for (int j = 0; j < currentImagePoints.size(); j++) {
cv::Point3f newPoint = cv::Point3f(j % _patternSize.width, j / _patternSize.width, 0);
currentObjectPoints.push_back(newPoint);
}
objectPoints->push_back(currentObjectPoints);
}
std::vector<cv::Mat> rvecs, tvecs;
static CGSize size = CGSizeMake(_imageSize.width, _imageSize.height);
cv::Mat cameraMatrix = [_userDefaultsManager cameraMatrixwithCurrentResolution:size]; // previously detected matrix
cv::Mat coeffs = _userDefaultsManager.distCoeffs; // previously detected coeffs
cv::calibrateCamera(*objectPoints, *_imagePoints, _imageSize, cameraMatrix, coeffs, rvecs, tvecs);
Results are like you've seen in the video.
What am I doing wrong? is that an issue in the code? How much images should I use to perform calibration (right now I'm trying to obtain 20-30 images before end of calibration).
Should I use images that containg wrongly detected chessboard corners, like this:
or should I use only properly detected chessboards like these:
I've been experimenting with circles grid instead of of chessboards, but results were much worse that now.
In case of questions how I'm detecting marker: I'm using solvepnp function:
solvePnP(modelPoints, imagePoints, [_arEngine currentCameraMatrix], _userDefaultsManager.distCoeffs, rvec, tvec);
with modelPoints specified like this:
markerPoints3D.push_back(cv::Point3d(-kMarkerRealSize / 2.0f, -kMarkerRealSize / 2.0f, 0));
markerPoints3D.push_back(cv::Point3d(kMarkerRealSize / 2.0f, -kMarkerRealSize / 2.0f, 0));
markerPoints3D.push_back(cv::Point3d(kMarkerRealSize / 2.0f, kMarkerRealSize / 2.0f, 0));
markerPoints3D.push_back(cv::Point3d(-kMarkerRealSize / 2.0f, kMarkerRealSize / 2.0f, 0));
and imagePoints are coordinates of marker corners in processing image (I'm using custom algorithm to do that)

In order to properly debug your problem I would need all the code :-)
I assume you are following the approach suggested in the tutorials (calibration and pose) cited by #kobejohn in his comment and so that your code follows these steps:
collect various images of chessboard target
find chessboard corners in images of point 1)
calibrate the camera (with cv::calibrateCamera) and so obtain as a result the intrinsic camera parameters (let's call them intrinsic) and the lens distortion parameters (let's call them distortion)
collect an image of your own custom target (the target is seen at 0:57 in your video) and it is shown in the following figure and find some relevant points in it (let's call the point you found in image image_custom_target_vertices and world_custom_target_vertices the corresponding 3D points).
estimate the rotation matrix (let's call it R) and the translation vector (let's call it t) of the camera from the image of your own custom target you get in point 4), with a call to cv::solvePnP like this one cv::solvePnP(world_custom_target_vertices,image_custom_target_vertices,intrinsic,distortion,R,t)
giving the 8 corners cube in 3D (let's call them world_cube_vertices) you get the 8 2D image points (let's call them image_cube_vertices) by means of a call to cv2::projectPoints like this one cv::projectPoints(world_cube_vertices,R,t,intrinsic,distortion,image_cube_vertices)
draw the cube with your own draw function.
Now, the final result of the draw procedure depends on all the previous computed data and we have to find where the problem lies:
Calibration: as you observed in your answer, in 3) you should discard the images where the corners are not properly detected. You need a threshold for the reprojection error in order to discard "bad" chessboard target images. Quoting from the calibration tutorial:
Re-projection Error
Re-projection error gives a good estimation of just how exact is the
found parameters. This should be as close to zero as possible. Given
the intrinsic, distortion, rotation and translation matrices, we first
transform the object point to image point using cv2.projectPoints().
Then we calculate the absolute norm between what we got with our
transformation and the corner finding algorithm. To find the average
error we calculate the arithmetical mean of the errors calculate for
all the calibration images.
Usually you will find a suitable threshold with some experiments. With this extra step you will get better values for intrinsic and distortion.
Finding you own custom target: it does not seem to me that you explain how you find your own custom target in the step I labeled as point 4). Do you get the expected image_custom_target_vertices? Do you discard images where that results are "bad"?
Pose of the camera: I think that in 5) you use intrinsic found in 3), are you sure nothing is changed in the camera in the meanwhile? Referring to the Callari's Second Rule of Camera Calibration:
Second Rule of Camera Calibration: "Thou shalt not touch the lens
after calibration". In particular, you may not refocus nor change the
f-stop, because both focusing and iris affect the nonlinear lens
distortion and (albeit less so, depending on the lens) the field of
view. Of course, you are completely free to change the exposure time,
as it does not affect the lens geometry at all.
And then there may be some problems in the draw function.

So, I've experimented a lot with my code, and I still haven't fixed the main issue (shifted objects), but I've managed to answer some of calibration questions I've asked.
First of all - in order to obtain good calibration results you have to use images with properly detected grid elements/circles positions!. Using all captured images in calibration process (even those that aren't properly detected) will result bad calibration.
I've experimented with various calibration patterns:
Asymmetric circles pattern (CALIB_CB_ASYMMETRIC_GRID), give much worse results than any other pattern. By worse results I mean that it produces a lot of wrongly detected corners like these:
I've experimented with CALIB_CB_CLUSTERING and it haven't helped much - in some cases (different light environment) it got better, but not much.
Symmetric circles pattern (CALIB_CB_SYMMETRIC_GRID) - better results than asymmetric grid, but still I've got much worse results than standard grid (chessboard). It often produces errors like these:
Chessboard (found using findChessboardCorners function) - this method is producing best possible results - it doesn't produce misaligned corners very often, and almost every calibration is producing similar results to best-possible results from symmetric circles grid
For every calibration I've been using 20-30 images that were coming from different angles. I've tried even with 100+ images but it haven't produced noticeable change in calibration results than smaller amount of images. It's worth noticing that larger number of test images is increasing time needed to compute camera parameters in non-linear way (100 test images in 480x360 resolution are computing 25 minutes in iPad4, compared with 4 minutes with ~50 images)
I've also experimented with solvePNP parameters - but is also haven't gave me any acceptable results: I've tried all 3 detection methods (ITERATIVE, EPNP and P3P), but I haven't seen aby noticeable change.
Also I've tried with useExtrinsicGuess set to true, and I've used rvec and tvec from previous detection, but this one resulted with complete disapperance of detected cube.
I've ran out of ideas - what else could be affecting these shifting problems?

For those still interested:
this is an old question, but I think your problem is not the bad calibration.
I developed an AR app for iOS, using OpenCV and SceneKit, and I have had your same issue.
I think your problem is the wrong render position of the cube:
OpenCV's solvePnP returns the X, Y, Z coordinates of the marker center, but you wanna render the cube over the marker, at a specific distance along the Z axis of the marker, exactly at one half of the cube side size. So you need to improve the Z coordinate of the marker translation vector of this distance.
In fact, when you see your cube from the top, the cube is render properly.
I have done an image in order to explain the problem, but my reputation prevent to post it.

Extract face rotation from homography in a video

I'm trying to determine the orientation of a face in a video.
The video starts with the frontal image of the face, so it has no rotation. In the following frames the head rotates and i'm trying to determine the rotation, which will lead me to determine the face orientation based on the camera position.
I'm using OpenCV and C++ for the job.
I'm using SURF descriptors to find points on the face which i use to calculate an homography between the two images. Being the two frames very close to each other, the head rotation will be minimal in that interval and my homography matrix will be close to the identity matrix.
This is my homography matrix:
H = findHomography(k1,k2,RANSAC,8);
where k1 and k2 are the keypoints extracted with SURF.
I'm using decomposeProjectionMatrix to extract the rotation matrix but now i'm not sure how to interpret the rotMatrix. This one too is basically (1 0 0; 0 1 0; 0 0 1) (where the 0 are numbers in a range from e-10 to e-16).
In theory, what is was trying to do was to find the angle of the rotation at each frame and store it somewhere, so that if i get a 1° change in each frame, after 10 frames i know that my head has changed its orientation by 10°.
I spend some time reading everything i could find about QR decomposition, homography matrices and so on, but i haven't been able to get around this. Hence, any help would be really appreciated.
Thanks!

The upper-left 2x2 of the homography matrix is a 2D rotation matrix. If you work through the multiplication of the matrix with a point (i.e. take R*p), you'll see it's equivalent to:
newX = oldVector dot firstRow
newY = oldVector dot secondRow
In other words, the first row of the matrix is a unit vector which is the x axis of the new head. (If there's a scale difference between the frames it won't be a unit vector, but this method will still work.) So you should be able to calculate
rotation = atan2(second entry of first row, first entry of first row)

Open CV - Several Methods for SfM

I got a task:
We have a system working where a camera does a halfcircle around a human head. We know the camera matrix and the rotation/translation of every frame. (Distortion and more... but I want first to work without these parameters)
My task is that I have only the Camera Matrix, which is constant over this move, and the images (more than 100). Now I have to get the translation and rotation from frame by frame and compare it with the rotation and translation in real world (from the system which I have but only for compare, I have too prove it!)
First steps I did so far:
use the robustMatcher from the OpenCV Cookbook - works finde - 40-70 Matches each frame - visible looks it very good!
I get the fundamentalMatrix with getFundamental(). I use the robust Points from robustMatcher and RANSAC.
When I got the F i can get the Essentialmatrix E with my CameraMatrix K like this:
cv::Mat E = K.t() * F * K; //Found at the Bible HZ Chapter 9.12
Now we need to extract R and t out of E with SVD. By the way camera1 position is just zero because we have to start somewhere.
cv::SVD svd(E);
cv::SVD svd(E);
cv::Matx33d W(0,-1,0, //HZ 9.13
1,0,0,
0,0,1);
cv::Matx33d Wt(0,1,0,//W^
-1,0,0,
0,0,1);
cv::Mat R1 = svd.u * cv::Mat(W) * svd.vt; //HZ 9.19
cv::Mat R2 = svd.u * cv::Mat(Wt) * svd.vt; //HZ 9.19
//R1 or R2???
R = R1; //R2
//t=+u3 or t=-u3?
t = svd.u.col(2); //=u3
This is my actual status!
My plans are:
triangulate all points to get 3D points
Join frame i with frame i++
Visualize my 3D points them somehow!
Now my Questions are:
is this robust matcher dated? is there a other method?
Is it wrong to use this points as descriped at my second step? Must they be converted with distortion or something?
What R and t is this i extract here? Is it the rotation and translation between camera1 and camera2 with point of view from camera1?
When I read at the bible or papers or elsewhere i find that there are 4 possibilities how R and t can be!
´P′ = [UWV^T |+u3] oder [UWV^T |−u3] oder [UW^TV^T |+u3] oder [UW^TV^T |−u3]´
P´ is the projectionmatrix of the second image.
That means t could be - or + and R could be total different?!
I found out that I should calculate one point into 3D and find out if this point is infront of both cameras, then I have found the correct matrix!
I found some of this code at the internet and he just said this no further calculating:
cv::Mat R1 = svd.u * cv::Mat(W) * svd.vt
and
t = svd.u.col(2); //=u3
Why is this correct? If it isn't - how would I do this triangulation in OpenCV?
I compared this translation to the translation which is given to me. (First i had to transfer the translation and rotation in relationship to camera1 but I got this now!) But its not the same. The values of my program are just lets call it jumping from plus too minus. But it should be more constant because the camera is moving in a constant circle.
I am sure that some axes may be switched. I know that the translation is only from -1 till 1 but I thought I could extract a factor from my results to my comparevalues and then it should be similiar.
Does somebody have done something like this before?
Many people doing a camera calibration by using a chessboard, but I can't use this method to get the extrinsic parameters.
I know that visual sfm can do this somehow. (At youtube is a video where someone walks around a tree and get from these pictures a reconstruction of this tree using visual sfm)
This is pretty the same what I have to do.
Last question:
Does somebody know an easy way to visualize my 3D Points? I prefere MeshLab. Some experience with that?

Many people doing a camera calibration by using a chessboard, but I can't use this method to get the extrinsic parameters.
A chess board or checker board is used to find the internal/intrinsic matrix/parameters, not the extrinsic parameters. You're saying you have got the internal matrix already, I suppose that's what you meant by
We know the camera matrix and ...
Those videos you have seen on youtube have done the same, the camera is already calibrated, that is the internal matrix is known.
is this robust matcher dated? is there a other method?
I don't have that book so cant see the code and answer this.
Is it wrong to use this points as descriped at my second step? Must they be converted with distortion or something?
You need to cancel the radial distortion first, see undistortPoints.
What R and t is this i extract here? Is it the rotation and translation between camera1 and camera2 with point of view from camera1?
R is the orientation of the second camera in the first camera's coordinate system. And T is position of the second camera in that coordinate system. These have several usages.
When I read at the bible or papers or elsewhere i find that there are 4 possibilities how ....
Read the relevant section of the bible, this is very well explained there, triangulation is naive method, a better approach is explained there.
Does somebody know an easy way to visualize my 3D Points?
To see them in Meshlab a very easy way is to save the coordinate of the 3D points in a PLY file, this is an extremely simple format and supported by Meshlab and almost all other 3D model viewers.

In this article "An Efficient Solution to the Five-Point Relative Pose Problem", Nistér explain a very good method to determine which of the four configurations it the correct one (talking about R and T).
I've tried the robust matcher and I think is quiet good. The problems that has this matcher is that is really slow because it uses SURF, maybe you should try with others detectors and extractors to improve the speed.I also believe that the function in OpenCV that calculates the fundamental matrix does not need the Ransac parameter because the methods rate and symmetry do a great job removing the outliers, you should try the 8-point parameter.
OpenCV has the function triangulate, this only needs two projection Matrices, points that are in the first and the second image. Check the calib3d module.

How to find Camera matrix for Augmented Reality?

I want to augment a virtual object at x,y,z meters wrt camera. OpenCV has camera calibration functions but I don't understand how exactly I can give coordinates in meters
I tried simulating a Camera in Unity but don't get expected result.
I set the projection matrix as follows and create a unit cube at z = 2.415 + 0.5 .
Where 2.415 is the distance between the eye and the projection plane (Pinhole camera model)
Since the cube's face is at the front clipping plane and it's dimension are unit shouldn't it cover the whole viewport?
Matrix4x4 m = new Matrix4x4();
m[0, 0] = 1;
m[0, 1] = 0;
m[0, 2] = 0;
m[0, 3] = 0;
m[1, 0] = 0;
m[1, 1] = 1;
m[1, 2] = 0;
m[1, 3] = 0;
m[2, 0] = 0;
m[2, 1] = 0;
m[2, 2] = -0.01f;
m[2, 3] = 0;
m[3, 0] = 0;
m[3, 1] = 0;
m[3, 2] = -2.415f;
m[3, 3] = 0;

The global scale of your calibration (i.e. the units of measure of 3D space coordinates) is determined by the geometry of the calibration object you use. For example, when you calibrate in OpenCV using images of a flat checkerboard, the inputs to the calibration procedure are corresponding pairs (P, p) of 3D points P and their images p, the (X, Y, Z) coordinates of the 3D points are expressed in mm, cm, inches, miles, whatever, as required by the size of target you use (and the optics that images it), and the 2D coordinates of the images are in pixels. The output of the calibration routine is the set of parameters (the components of the projection matrix P and the non-linear distortion parameters k) that "convert" 3D coordinates expressed in those metrical units into pixels.
If you don't know (or don't want to use) the actual dimensions of the calibration target, you can just fudge them but leave their ratios unchanged (so that, for example, a square remains a square even though the true length of its side may be unknown). In this case your calibration will be determined up to an unknown global scale. This is actually the common case: in most virtual reality applications you don't really care what the global scale is, as long as the results look correct in the image.
For example, if you want to add an even puffier pair of 3D lips on a video of Angelina Jolie, and composite them with the original video so that the brand new fake lips stay attached and look "natural" on her face, you just need to rescale the 3D model of the fake lips so that it overlaps correctly the image of the lips. Whether the model is 1 yard or one mile away from the CG camera in which you render the composite is completely irrelevant.

For finding augmenting an object you need to find camera pose and orientation. That is the same as finding the camera extrinsics. You also have to calculate first the camera intrinsics (which is called calibraiton).
OpenCV allows you to do all of this, but is not trivial, it requires work on your own. I give you a clue, you first need to recognize something in the scene that you know how it looks, so you can calculate the camera pose by analyzing this object, call it a marker. You can start by the tipical fiducials, they are easy to detect.
Have a look at this thread.

I ended up measuring the field of view manually. Once you know FOV you can easily create the projection matrix. No need to worry about units because in the end the projection is of the form ( X*d/Z, Y*d/Z ). Whatever the units of X,Y,Z may be the ratio of X/Z remains the same.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart