The Goal ::
I intend to have an uploaded image as a static background, and render 3d objects on a designated plane in that image.
I need to get the orientation of the camera, in relation to the plane. So then I can properly render the 3D models on said plane.
The user will specify the length & width of the plane. As well as outline the plane, resulting in the plane's 4 corners ( A,B,C,D on the 2D image ).
What I've tried ::
I've looked at using webassembly ported OpenCV, particularly solvePnP, but while testing I was getting the error Uncaught TypeError: Cannot read property '$$' of undefined at RegisteredPointer.nonConstNoSmartPtrRawPointerToWireType
Code I was using below:
// 3D world coords
var vv = cv.matFromArray( 4,3,cv.CV_32SC1,[
0,0,0,
0,4,0,
8,4,0,
8,0,0,
])
// 2D img coords
var imageP = cv.matFromArray( 4,2,cv.CV_8S,[
292,272,
72,379,
487,530,
701,470,
])
// camera internal params
var cm = new cv.Mat(3,3,cv.CV_32FC1,new cv.Scalar())
var rvec
var tvec
cv.solvePnP( vv, imageP, cm, new cv.Mat(), rvec, tvec, false, cv.SOLVEPNP_P3P )
With the known variables, is it possible to glean any information about the camera's orientation / position / FOV?
The answer is abit more complex.
First, you have to calibrate camera to get cameraMatrix. You should also remove distortion along the way,
var cm = new cv.Mat(3,3,cv.CV_32FC1,new cv.Scalar())
In ur code, you just declare it and did not put any content to it. and the math requires it as shown below
then You need to know the physical size of the object in order to do that.
The easiest way for you to start is you use chessboard calibration pattern sample which you can pre-enter its size.
You can follow this sample to find how to find the camera orientation and relative position to pre-determined objects.
https://docs.opencv.org/2.4/doc/tutorials/calib3d/camera_calibration/camera_calibration.html
The source code to achieve what you want is here
https://github.com/opencv/opencv/blob/master/modules/calib3d/src/calibration.cpp
Start from line 1400.
The sample result with AR can be found in this link
https://www.youtube.com/watch?v=2hek-DmiGEw
Related
I'm coding a calibration algorithm for my depth-camera. This camera outputs an one channel 2D image with the distance of every object in the image.
From that image, and using the camera and distortion matrices, I was able to create a 3D point cloud, from the camera perspective. Now I wish to convert those 3D coordinates to a global/world coordinates. But, since I can't use any patterns like the chessboard to calibrate the camera, I need another alternative.
So I was thinking: If I provide some ground points (in the camera perspective), I would define a plane that I know should have the Z coordinate close to zero, in the global perspective. So, how should I proceed to find the transformation matrix that horizontalizes the plane.
Local coordinates ground plane, with an object on top
I tried using the OpenCV's solvePnP, but it didn't gave me the correct transformation. Also I thought in using the OpenCV's estimateAffine3D, but I don't know where should the global coordinates be mapped to, since the provided ground points do not need to lay on any specific pattern/shape.
Thanks in advance
What you need is what's commonly called extrinsic calibration: a rigid transformation relating the 3D camera reference frame to the 'world' reference frame. Usually, this is done by finding known 3D points in the world reference frame and their corresponding 2D projections in the image. This is what SolvePNP does.
To find the best rotation/translation between two sets of 3D points, in the sense of minimizing the root mean square error, the solution is:
Theory: https://igl.ethz.ch/projects/ARAP/svd_rot.pdf
Easier explanation: http://nghiaho.com/?page_id=671
Python code (from the easier explanation site): http://nghiaho.com/uploads/code/rigid_transform_3D.py_
So, if you want to transform 3D points from the camera reference frame, do the following:
As you proposed, define some 3D points with known position in the world reference frame, for example (but not necessarily) with Z=0. Put the coordinates in a Nx3 matrix P.
Get the corresponding 3D points in the camera reference frame. Put them in a Nx3 matrix Q.
From the file defined in point 3 above, call rigid_transform_3D(P, Q). This will return a 3x3 matrix R and a 3x1 vector t.
Then, for any 3D point in the world reference frame p, as a 3x1 vector, you can obtain the corresponding camera point, q with:
q = R.dot(p)+t
EDIT: answer when 3D position of points in world are unspecified
Indeed, for this procedure to work, you need to know (or better, to specify) the 3D coordinates of the points in your world reference frame. As stated in your comment, you only know the points are in a plane but don't have their coordinates in that plane.
Here is a possible solution:
Take the selected 3D points in camera reference frame, let's call them q'i.
Fit a plane to these points, for example as described in https://www.ilikebigbits.com/2015_03_04_plane_from_points.html. The result of this will be a normal vector n. To fully specify the plane, you need also to choose a point, for example the centroid (average) of q'i.
As the points surely don't perfectly lie in the plane, project them onto the plane, for example as described in: How to project a point onto a plane in 3D?. Let's call these projected points qi.
At this point you have a set of 3D points, qi, that lie on a perfect plane, which should correspond closely to the ground plane (z=0 in world coordinate frame). The coordinates are in the camera reference frame, though.
Now we need to specify an origin and the direction of the x and y axes in this ground plane. You don't seem to have any criteria for this, so an option is to arbitrarily set the origin just "below" the camera center, and align the X axis with the camera optical axis. For this:
Project the point (0,0,0) into the plane, as you did in step 4. Call this o. Project the point (0,0,1) into the plane and call it a. Compute the vector a-o, normalize it and call it i.
o is the origin of the world reference frame, and i is the X axis of the world reference frame, in camera coordinates. Call j=nxi ( cross product). j is the Y-axis and we are almost finished.
Now, obtain the X-Y coordinates of the points qi in the world reference frame, by projecting them on i and j. That is, do the dot product between each qi and i to get the X values and the dot product between each qi and j to get the Y values. The Z values are all 0. Call these X, Y, 0 coordinates pi.
Use these values of pi and qi to estimate R and t, as in the first part of the answer!
Maybe there is a simpler solution. Also, I haven't tested this, but I think it should work. Hope this helps.
I am using opencv to calibrate my webcam. So, what I have done is fixed my webcam to a rig, so that it stays static and I have used a chessboard calibration pattern and moved it in front of the camera and used the detected points to compute the calibration. So, this is as we can find in many opencv examples (https://docs.opencv.org/3.1.0/dc/dbb/tutorial_py_calibration.html)
Now, this gives me the camera intrinsic matrix and a rotation and translation component for mapping each of these chessboard views from the chessboard space to world space.
However, what I am interested in is the global extrinsic matrix i.e. once I have removed the checkerboard, I want to be able to specify a point in the image scene i.e. x, y and its height and it gives me the position in the world space. As far as I understand, I need both the intrinsic and extrinsic matrix for this. How should one proceed to compute the extrinsic matrix from here? Can I use the measurements that I have already gathered from the chessboard calibration step to compute the extrinsic matrix as well?
Let me place some context. Consider the following picture, (from https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html):
The camera has "attached" a rigid reference frame (Xc,Yc,Zc). The intrinsic calibration that you successfully performed allows you to convert a point (Xc,Yc,Zc) into its projection on the image (u,v), and a point (u,v) in the image to a ray in (Xc,Yc,Zc) (you can only get it up to a scaling factor).
In practice, you want to place the camera in an external "world" reference frame, let's call it (X,Y,Z). Then there is a rigid transformation, represented by a rotation matrix, R, and a translation vector T, such that:
|Xc| |X|
|Yc|= R |Y| + T
|Zc| |Z|
That's the extrinsic calibration (which can be written also as a 4x4 matrix, that's what you call the extrinsic matrix).
Now, the answer. To obtain R and T, you can do the following:
Fix your world reference frame, for example the ground can be the (x,y) plane, and choose an origin for it.
Set some points with known coordinates in this reference frame, for example, points in a square grid in the floor.
Take a picture and get the corresponding 2D image coordinates.
Use solvePnP to obtain the rotation and translation, with the following parameters:
objectPoints: the 3D points in the world reference frame.
imagePoints: the corresponding 2D points in the image in the same order as objectPoints.
cameraMatris: the intrinsic matrix you already have.
distCoeffs: the distortion coefficients you already have.
rvec, tvec: these will be the outputs.
useExtrinsicGuess: false
flags: you can use CV_ITERATIVE
Finally, get R from rvec with the Rodrigues function.
You will need at least 3 non-collinear points with corresponding 3D-2D coordinates for solvePnP to work (link), but more is better. To have good quality points, you could print a big chessboard pattern, put it flat in the floor, and use it as a grid. What's important is that the pattern is not too small in the image (the larger, the more stable your calibration will be).
And, very important: for the intrinsic calibration, you used a chess pattern with squares of a certain size, but you told the algorithm (which does kind of solvePnPs for each pattern), that the size of each square is 1. This is not explicit, but is done in line 10 of the sample code, where the grid is built with coordinates 0,1,2,...:
objp[:,:2] = np.mgrid[0:7,0:6].T.reshape(-1,2)
And the scale of the world for the extrinsic calibration must match this, so you have several possibilities:
Use the same scale, for example by using the same grid or by measuring the coordinates of your "world" plane in the same scale. In this case, you "world" won't be at the right scale.
Recommended: redo the intrinsic calibration with the right scale, something like:
objp[:,:2] = (size_of_a_square*np.mgrid[0:7,0:6]).T.reshape(-1,2)
Where size_of_a_square is the real size of a square.
(Haven't done this, but is theoretically possible, do it if you can't do 2) Reuse the intrinsic calibration by scaling fx and fy. This is possible because the camera sees everything up to a scale factor, and the declared size of a square only changes fx and fy (and the T in the pose for each square, but that's another story). If the actual size of a square is L, then replace fx and fy Lfx and Lfy before calling solvePnP.
So I have been tinkering a little bit with opencv and I want to be able to use a camera image to get the position of certain objects that are lying flat on plane. These objects are simple shapes such as circles squares etc. They all have the same height of 5cm. To be able to relate real world points to pixels on the camera I painted 4 white squares on the plane with known distances between them.
So the steps I have been taking are:
Initialization:
Calibrate my camera using a checkerboard image and save the calibration data.
Get the input image. call cv::undistort with the calibration data for my camera.
Find the center points of the 4 squares in the image and pass that data and the real world coordinates of the squares to the cv::solvePnP function. Save the rvec and tvec return parameters.
Warp the perspective of the image so you can get a top down view from the image. This is essentially following this tutorial: https://docs.opencv.org/3.4.1/d9/dab/tutorial_homography.html
Use the resulting image to again find the 4 white squares and then calculate a "pixels per meter" translation constant which can relate a certain amount of difference in pixels between points to the real world distance on the plane where the 4 squares are.
Finding object, This is done after initialization:
Get the input image. call cv::undistort with the calibration data for my camera.
Warp the perspective of the image so you can get a top down view from the image. This is the same as step 4 during initialisation.
Find the centerpoint of the object to detect.
Since the centerpoint of the object is on a higher plane then where I calibrated I use the following formula to correct this(d = is the pixel offset from the center of the image. camHeight is the cameraHeight I measured by using a tape measure. h is height of the object):
d = x - (h * (x / camHeight))
So here for an illustration how I got this formule:
But still the coordinates are not matching up...
So I am wondering at all if this is the correct. Specifically I have the following questions:
Is using cv::undistort before using cv::solvenPnP correct? cv::solvePnP also takes the camera calibration data as input so I'm not sure if I have to pass an undistorted image to it or not.
Similar to 1. During Finding object I call cv::undistort -> cv::warpPerspective. Is this undistort necessary here?
Is my calculation to correct for the parallel planes in step 4 correct? I feel like I am missing something but I can't see what. One thing I am wondering is whether I can get the camera height from opencv once solvePnp is done.
I am a newbie to CV so If anything else is totally wrong please also point it out to me.
Thank you for reading this wall of text!
I read the whole documentation on all ARKit classes up and down. I don't see any place that describes ability to actually get the user face's Texture.
ARFaceAnchor contains the ARFaceGeometry (topology and geometry comprised of vertices) and the BlendShapeLocation array (coordinates allowing manipulations of individual facial traits by manipulating geometric math on the user face's vertices).
But where can I get the actual Texture of the user's face. For example: the actual skin tone / color / texture, facial hair, other unique traits, such as scars or birth marks? Or is this not possible at all?
You want a texture-map-style image for the face? There’s no API that gets you exactly that, but all the information you need is there:
ARFrame.capturedImage gets you the camera image.
ARFaceGeometry gets you a 3D mesh of the face.
ARAnchor and ARCamera together tell you where the face is in relation to the camera, and how the camera relates to the image pixels.
So it’s entirely possible to texture the face model using the current video frame image. For each vertex in the mesh...
Convert the vertex position from model space to camera space (use the anchor’s transform)
Multiply with the camera projection with that vector to get to normalized image coordinates
Divide by image width/height to get pixel coordinates
This gets you texture coordinates for each vertex, which you can then use to texture the mesh using the camera image. You could do this math either all at once to replace the texture coordinate buffer ARFaceGeometry provides, or do it in shader code on the GPU during rendering. (If you’re rendering using SceneKit / ARSCNView you can probably do this in a shader modifier for the geometry entry point.)
If instead you want to know for each pixel in the camera image what part of the face geometry it corresponds to, it’s a bit harder. You can’t just reverse the above math because you’re missing a depth value for each pixel... but if you don’t need to map every pixel, SceneKit hit testing is an easy way to get geometry for individual pixels.
If what you’re actually asking for is landmark recognition — e.g. where in the camera image are the eyes, nose, beard, etc — there’s no API in ARKit for that. The Vision framework might help.
I've put together a demo iOS app that shows how to accomplish this. The demo captures a face texture map in realtime, applying it back to a ARSCNFaceGeometry to create a textured 3D model of the user's face.
Below you can see the realtime textured 3D face model in the top left, overlaid on top of the AR front facing camera view:
The demo works by rendering an ARSCNFaceGeometry, however instead of rendering it normally, you instead render it in texture space while continuing to use the original vertex positions to determine where to sample from in the captured pixel data.
Here are links to the relevant parts of the implementation:
FaceTextureGenerator.swift — The main class for generating face textures. This sets up a Metal render pipeline to generate the texture.
faceTexture.metal — The vertex and fragment shaders used to generate the face texture. These operate in texture space.
Almost all the work is done in a metal render pass, so it easily runs in realtime.
I've also put together some notes covering the limitations of the demo
If you instead want a 2D image of the user's face, you can try doing the following:
Render the transformed ARSCNFaceGeometry to a 1-bit buffer to create an image mask. Basically you just want places where the face model appears to be white, while everything else should be black.
Apply the mask to the captured frame image.
This should give you an image with just the face (although you will likely need to crop the result)
You can calculate the texture coordinates as follows:
let geometry = faceAnchor.geometry
let vertices = geometry.vertices
let size = arFrame.camera.imageResolution
let camera = arFrame.camera
modelMatrix = faceAnchor.transform
let textureCoordinates = vertices.map { vertex -> vector_float2 in
let vertex4 = vector_float4(vertex.x, vertex.y, vertex.z, 1)
let world_vertex4 = simd_mul(modelMatrix!, vertex4)
let world_vector3 = simd_float3(x: world_vertex4.x, y: world_vertex4.y, z: world_vertex4.z)
let pt = camera.projectPoint(world_vector3,
orientation: .portrait,
viewportSize: CGSize(
width: CGFloat(size.height),
height: CGFloat(size.width)))
let v = 1.0 - Float(pt.x) / Float(size.height)
let u = Float(pt.y) / Float(size.width)
return vector_float2(u, v)
}
I've studied the example here:
http://learningwebgl.com/lessons/lesson01/index.html
The matrix library in use is:
http://glmatrix.net/
So the example is in clip space and I understand that. I would like to set up an orthographic projection using this example as my base. The Orthographic projection should have it's center at 0,0,0 with an eye location somewhere at +Z and looking down on 0,0,0. When I enter coordinates for my 3d faces in the buffers I would like to be able to enter those coordinates in model space units. Example for my mapping exhibit I have an area of 10,000 cubics x -5000 to +5000 y -5000 to +5000 and z -5000 to +5000 that is to be projected onto the canvas which is 500 x 500. So my 3d faces will have coordinates someplace within those 10,000 cubics and the 500 x 500 canvas will display all 10,000 cubics.
This is the same projection that a CAD program would use to start a scratch drawing. Does anyone know how to do this in WebGL with the glMatrix library? I am new to WebGL and I really could use some guidance on this topic.
For sake of simplicity you first should separate the camera transform matrix from it's projection matrix. You can then multiply them to get the "view-projection matrix" which transform world-space coordinates to screen-space.
var cam = mat4.create();
var proj = mat4.create();
Start placing your camera (cam matrix)
mat4.translate( cam, cam, position )
mat4.rotate( ... )
mat4.lookAt( .. )
//...
Setup othographic projection (proj matrix). You can see the ortho projection like a box aligned with the camera, you can expand each sides with the six parameters. Everything inside the box will show on screen.
var ratio = screenWidth/screenHeight;
var halfWorldWidth = 5000.0;
// the near/far will depend on your camera position
mat4.ortho( proj,
-halfWorldWidth,
halfWorldWidth,
-halfWorldWidth / ratio,
halfWorldWidth /ratio,
-50,
50
)
Finally get the view-projection
var view = mat4.create()
var viewProj = mat4.create()
mat4.invert( view, cam );
mat4.multiply( viewProj, lens, view );