Apply transform computed at lower resolution to a higher resolution object - opencv

I'm working on an iOS app that has to compute the transform matrix between consecutive real time video frames. I'm using OpenCV to compute optical flow and then find the affine matrix.
This process was working perfectly but was a little slow, so I'm now downsampling each frame to half its size before start processing it. The thing is, I have to later apply the transform to another video frames with the original resolution (double of the one for which I compute the matrix).
My question is: how should I apply the transform matrix I have computed for a frame at resolution X, to another frame with resolution 2X? I know I should "scale" the matrix somehow, but not sure how. I've tried multiplying the translation components of the matrix by 2, and this works almost perfectly (although I don't understand why), but depending on the transformation sometimes is not accurate.
One possible solution is to scale the frame to half its size, apply the transform and then scale it back to its original size, but this have a cost in performance, that's why I'm trying to compute a single matrix I can later use to transform the frame.

If your use a homography from H_00 to H_22 then you have to apply a scale factor to H_00 and H_11.
I would recomend another workaround. After tracking or correspondenz estimation. If x0[n], y0[n] are your start points and x1[n] , y1[n] your end points multiply both with a scale factor an than run findHomography or getAffineMatrix. E.g. let w0=200 and h0=100 be the width and height of your frame you estimated the feautre correspondenzes with and w1=400 and h1=300 the frame you want to apply. Than sx=2 and sy=3 are the scale factors and x0[n] = x0[n] * sx and y0[n] = y0[n] * sy and x1[n] = x1[n] * sx , etc.

Related

How to get the transformation matrix of a 3d model to object in a 2d image

Given an object's 3D mesh file and an image that contains the object, what are some techniques to get the orientation/pose parameters of the 3d object in the image?
I tried searching for some techniques, but most seem to require texture information of the object or at least some additional information. Is there a way to get the pose parameters using just an image and a 3d mesh file (wavefront .obj)?
Here's an example of a 2D image that can be expected.
FOV of camera
Field of view of camera is absolute minimum to know to even start with this (how can you determine how to place object when you have no idea how it would affect scene). Basically you need transform matrix that maps from world GCS (global coordinate system) to Camera/Screen space and back. If you do not have a clue what about I am writing then perhaps you should not try any of this before you learn the math.
For unknown camera you can do some calibration based on markers or etalones (known size and shape) in the view. But much better is use real camera values (like FOV angles in x,y direction, focal length etc ...)
The goal for this is to create function that maps world GCS(x,y,z) into Screen LCS(x,y).
For more info read:
transform matrix anatomy
3D graphic pipeline
Perspective projection
Silhouette matching
In order to compare rendered and real image similarity you need some kind of measure. As you need to match geometry I think silhouette matching is the way (ignoring textures, shadows and stuff).
So first you need to obtain silhouettes. Use image segmentation for that and create ROI mask of your object. For rendered image is this easy as you van render the object with single color without any lighting directly into ROI mask.
So you need to construct function that compute the difference between silhouettes. You can use any kind of measure but I think you should start with non overlapping areas pixel count (it is easy to compute).
Basically you count pixels that are present only in one ROI (region of interest) mask.
estimate position
as you got the mesh then you know its size so place it in the GCS so rendered image has very close bounding box to real image. If you do not have FOV parameters then you need to rescale and translate each rendered image so it matches images bounding box (and as result you obtain only orientation not position of object of coarse). Cameras have perspective so the more far from camera you place your object the smaller it will be.
fit orientation
render few fixed orientations covering all orientations with some step 8^3 orientations. For each compute the difference of silhouette and chose orientation with smallest difference.
Then fit the orientation angles around it to minimize difference. If you do not know how optimization or fitting works see this:
How approximation search works
Beware too small amount of initial orientations can cause false positioves or missed solutions. Too high amount will be slow.
Now that was some basics in a nutshell. As your mesh is not very simple you may need to tweak this like use contours instead of silhouettes and using distance between contours instead of non overlapping pixels count which is really hard to compute ... You should start with simpler meshes like dice , coin etc ... and when grasping all of this move to more complex shapes ...
[Edit1] algebraic approach
If you know some points in the image that coresponds to known 3D points (in your mesh) then you can along with the FOV of the camera used compute the transform matrix placing your object ...
if the transform matrix is M (OpenGL style):
M = xx,yx,zx,ox
xy,yy,zy,oy
xz,yz,zz,oz
0, 0, 0, 1
Then any point from your mesh (x,y,z) is transformed to global world (x',y',z') like this:
(x',y',z') = M * (x,y,z)
The pixel position (x'',y'') is done by camera FOV perspective projection like this:
y''=FOVy*(z'+focus)*y' + ys2;
x''=FOVx*(z'+focus)*x' + xs2;
where camera is at (0,0,-focus), projection plane is at z=0 and viewing direction is +z so for any focal length focus and screen resolution (xs,ys):
xs2=xs*0.5;
ys2=ys*0.5;
FOVx=xs2/focus;
FOVy=ys2/focus;
When put all this together you obtain this:
xi'' = ( xx*xi + yx*yi + zx*zi + ox ) * ( xz*xi + yz*yi + zz*zi + ox + focus ) * FOVx
yi'' = ( xy*xi + yy*yi + zy*zi + oy ) * ( xz*xi + yz*yi + zz*zi + oy + focus ) * FOVy
where (xi,yi,zi) is i-th known point 3D position in mesh local coordinates and (xi'',yi'') is corresponding known 2D pixel positions. So unknowns are the M values:
{ xx,xy,xz,yx,yy,yx,zx,zy,zz,ox,oy,oz }
So we got 2 equations per each known point and 12 unknowns total. So you need to know 6 points. Solve the system of equations and construct your matrix M.
Also you can exploit that M is a uniform orthogonal/orthonormal matrix so vectors
X = (xx,xy,xz)
Y = (yx,yy,yz)
Z = (zx,zy,zz)
Are perpendicular to each other so:
(X.Y) = (Y.Z) = (Z.X) = 0.0
Which can lower the number of needed points by introducing these to your system. Also you can exploit cross product so if you know 2 vectors the thirth can be computed
Z = (X x Y)*scale
So instead of 3 variables you need just single scale (which is 1 for orthonormal matrix). If I assume orthonormal matrix then:
|X| = |Y| = |Z| = 1
so we got 6 additional equations (3 x dot, and 3 for cross) without any additional unknowns so 3 point are indeed enough.

Extract face rotation from homography in a video

I'm trying to determine the orientation of a face in a video.
The video starts with the frontal image of the face, so it has no rotation. In the following frames the head rotates and i'm trying to determine the rotation, which will lead me to determine the face orientation based on the camera position.
I'm using OpenCV and C++ for the job.
I'm using SURF descriptors to find points on the face which i use to calculate an homography between the two images. Being the two frames very close to each other, the head rotation will be minimal in that interval and my homography matrix will be close to the identity matrix.
This is my homography matrix:
H = findHomography(k1,k2,RANSAC,8);
where k1 and k2 are the keypoints extracted with SURF.
I'm using decomposeProjectionMatrix to extract the rotation matrix but now i'm not sure how to interpret the rotMatrix. This one too is basically (1 0 0; 0 1 0; 0 0 1) (where the 0 are numbers in a range from e-10 to e-16).
In theory, what is was trying to do was to find the angle of the rotation at each frame and store it somewhere, so that if i get a 1° change in each frame, after 10 frames i know that my head has changed its orientation by 10°.
I spend some time reading everything i could find about QR decomposition, homography matrices and so on, but i haven't been able to get around this. Hence, any help would be really appreciated.
Thanks!
The upper-left 2x2 of the homography matrix is a 2D rotation matrix. If you work through the multiplication of the matrix with a point (i.e. take R*p), you'll see it's equivalent to:
newX = oldVector dot firstRow
newY = oldVector dot secondRow
In other words, the first row of the matrix is a unit vector which is the x axis of the new head. (If there's a scale difference between the frames it won't be a unit vector, but this method will still work.) So you should be able to calculate
rotation = atan2(second entry of first row, first entry of first row)

opencv face detection 2d to 3d

I want to convert the detected face rectangle into 3D coordinates. I have the intrinsic parameters of my webcam and my head dimension, how can I determine the depth Z using the projection equation?
x = fx X / Z + u
y = fy Y / Z + v
I understand that fx fy and u v are intrinsic parameters, and that X Y are given by my head dimension, x y are given by the detected face rectangle. It seems that only one equation is enough to determine Z. How to use both of them? Or I am wrong?
You are correct that you do not strictly need both of them to compute the depth. However, you may want to use both to improve accuracy.
Another thing to keep in mind is that if your camera is not looking perpendicularly at the planar object (e.g. face) you measured, one or both of the measurements may not be useful to compute the depth. For example if your camera is looking up at a rectangle, only the width will be a good measure for the depth, because the height is compressed by the viewing angle. I don't think this matters for your face detector though, because the proportions of the face are assumed fixed anyway?

opencv update homography matrix to fit on an image double the size

I'm doing video stabilization using optical flow. To make calcOpticalFlowPyrLK work faster I'm downscaling the original image 2x and running the function on that.
How can I modify the homograph matrix (retrieved via findHomography) to be able to warpPerspective the original, larger, image.
This is a little late and the answer you have works fine but I have one thing to add. I don't like taking functions like getPerspectiveTransform for granted. In this case it is easy to just make the matrix yourself. Image reductions that are powers of 2 are easy. Suppose you have a point and you want to move it to an image with twice the resolution.
float newx = (oldx+.5)*2 - .5;
float newy = (oldy+.5)*2 - .5;
conversely, to go to an image of half the resolution...
float newx = (oldx+.5)/2 - .5;
float newy = (oldy+.5)/2 - .5;
Draw yourself a diagram if you need to and convince yourself it works, remember 0 indexing. Instead of thinking about making your transformation work on other resolutions, think about moving every point to the resolution of your transform, then using your transform, then moving it back. Fortunately you can do all of this in 1 matrix, we just need to build that matrix! First build a matrix for each of the three steps
//move point to an image of half resolution, note it is equivalent to the above equation
project_down=(.5,0,-.25,
0,.5,-.25,
0, 0, 1)
//move point to an image of twice resolution, these are inverses of one another
project_up=(2,0,.5,
0,2,.5,
0, 0,1)
To make your final transformation just combine them
final_transform = [project_up][your_homography][project_down];
The nice thing is you only have to do this once for any given homography. This should work the same as getPerspectiveTransform (and probably run faster). Hopefully understanding this will help you deal with other questions you may run into regarding image resolution changes.
Let B be the transformation you have computed, you can multiply B by another homography, A, to get AB = C, where C is a homography that does both transformations, this is equivalent to apply first B and then A. To find A you can use getPerspectiveTransform.
Edit: by AB I meant matrix multiplication, not element-wise multiplication.
Edit 2: to get A you pass the four corners of the two images in the same order to getPerspectiveTransform such that the corners of the downsampled image are the source points and the corners of the original image are the destination points.

image transformations

So I've been using gnu-gsl and CImg to implement some of the fundamental projective space techniques for affine and metric rectification.
I've completed computing an affine rectification but, I'm having a hard time figuring out how to apply the affine rectification matrix to the original (input) image.
My current thought process is to iterate across the input image for each pixel coordinate. Then multiply the original pixel coordinate (converted to a homogeneous coordinate) by the affine rectification matrix to get the output pixel coordinate.
Then access the output image using the output pixel coordinate and conduct a blend (addition) operation on the output image's pixel location with the pixel color from the original image.
Does that sound right? I'm getting a lot of really weird values after multiplying the original pixel coordinate by the affine rectification matrix.
No, your values should not be weird. Why don't you make a simple example, a small scale with a small translation; e.g.
x' = 1.01*x + 0.0*y + 5;
y' = 0.0*x + 0.98*y + 10;
Now the pixel at (10,10) should map to (15.1,19.8), right ?
If you want to make a nice output image, you should find the forward projection and then back project to the input image and interpolate there rather than try to blend into the output image. Otherwise you will end up with gaps in the output.
You need to be careful with your terminology here; it sounds to me like you are doing projections, sometimes called warping in the computer graphics community. Rectification is something else, but it depends on what you are doing.

Resources