Redraw image from 3d perspective to 2d - delphi

I need an inverse perspective transform written in Pascal/Delphi/Lazarus. See the following image:
I think I need to walk through destination pixels and then calculate the corresponding position in the source image (To avoid problems with rounding errors etc.).
function redraw_3d_to_2d(sourcebitmap:tbitmap, sourceaspect:extended, point_a, point_b, point_c, point_d:tpoint, megapixelcount:integer):tbitmap;
var
destinationbitmap:tbitmap;
x,y,sx,sy:integer;
begin
destinationbitmap:=tbitmap.create;
destinationbitmap.width=megapixelcount*sourceaspect*???; // I dont how to calculate this
destinationbitmap.height=megapixelcount*sourceaspect*???; // I dont how to calculate this
for x:=0 to destinationbitmap.width-1 do
for y:=0 to destinationbitmap.height-1 do
begin
sx:=??;
sy:=??;
destinationbitmap.canvas.pixels[x,y]=sourcebitmap.canvas.pixels[sx,sy];
end;
result:=destinationbitmap;
end;
I need the real formula... So an OpenGL solution would not be ideal...

Note: There is a version of this with proper math typesetting on the Math SE.
Computing a projective transformation
A perspective is a special case of a projective transformation, which in turn is defined by four points.
Step 1: Starting with the 4 positions in the source image, named (x1,y1) through (x4,y4), you solve the following system of linear equations:
[x1 x2 x3] [λ] [x4]
[y1 y2 y3]∙[μ] = [y4]
[ 1 1 1] [τ] [ 1]
The colums form homogenous coordinates: one dimension more, created by adding a 1 as the last entry. In subsequent steps, multiples of these vectors will be used to denote the same points. See the last step for an example of how to turn these back into two-dimensional coordinates.
Step 2: Scale the columns by the coefficients you just computed:
[λ∙x1 μ∙x2 τ∙x3]
A = [λ∙y1 μ∙y2 τ∙y3]
[λ μ τ ]
This matrix will map (1,0,0) to a multiple of (x1,y1,1), (0,1,0) to a multiple of (x2,y2,1), (0,0,1) to a multiple of (x3,y3,1) and (1,1,1) to (x4,y4,1). So it will map these four special vectors (called basis vectors in subsequent explanations) to the specified positions in the image.
Step 3: Repeat steps 1 and 2 for the corresponding positions in the destination image, in order to obtain a second matrix called B.
This is a map from basis vectors to destination positions.
Step 4: Invert B to obtain B⁻¹.
B maps from basis vectors to the destination positions, so the inverse matrix maps in the reverse direction.
Step 5: Compute the combined Matrix C = A∙B⁻¹.
B⁻¹ maps from destination positions to basis vectors, while A maps from there to source positions. So the combination maps destination positions to source positions.
Step 6: For every pixel (x,y) of the destination image, compute the product
[x'] [x]
[y'] = C∙[y]
[z'] [1]
These are the homogenous coordinates of your transformed point.
Step 7: Compute the position in the source image like this:
sx = x'/z'
sy = y'/z'
This is called dehomogenization of the coordinate vector.
All this math would be so much easier to read and write if SO were to support MathJax… ☹
Choosing the image size
The above aproach assumes that you know the location of your corners in the destination image. For these you have to know the width and height of that image, which is marked by question marks in your code as well. So let's assume the height of your output image were 1, and the width were sourceaspect. In that case, the overall area would be sourceaspect as well. You have to scale that area by a factor of pixelcount/sourceaspect to achieve an area of pixelcount. Which means that you have to scale each edge length by the square root of that factor. So in the end, you have
pixelcount = 1000000.*megapixelcount;
width = round(sqrt(pixelcount*sourceaspect));
height = round(sqrt(pixelcount/sourceaspect));

Use Graphics32, specifically TProjectiveTransformation (to use with the Transform method). Don't forget to leave some transparent margin in your source image so you don't get jagged edges.

Related

How to get the transformation matrix of a 3d model to object in a 2d image

Given an object's 3D mesh file and an image that contains the object, what are some techniques to get the orientation/pose parameters of the 3d object in the image?
I tried searching for some techniques, but most seem to require texture information of the object or at least some additional information. Is there a way to get the pose parameters using just an image and a 3d mesh file (wavefront .obj)?
Here's an example of a 2D image that can be expected.
FOV of camera
Field of view of camera is absolute minimum to know to even start with this (how can you determine how to place object when you have no idea how it would affect scene). Basically you need transform matrix that maps from world GCS (global coordinate system) to Camera/Screen space and back. If you do not have a clue what about I am writing then perhaps you should not try any of this before you learn the math.
For unknown camera you can do some calibration based on markers or etalones (known size and shape) in the view. But much better is use real camera values (like FOV angles in x,y direction, focal length etc ...)
The goal for this is to create function that maps world GCS(x,y,z) into Screen LCS(x,y).
For more info read:
transform matrix anatomy
3D graphic pipeline
Perspective projection
Silhouette matching
In order to compare rendered and real image similarity you need some kind of measure. As you need to match geometry I think silhouette matching is the way (ignoring textures, shadows and stuff).
So first you need to obtain silhouettes. Use image segmentation for that and create ROI mask of your object. For rendered image is this easy as you van render the object with single color without any lighting directly into ROI mask.
So you need to construct function that compute the difference between silhouettes. You can use any kind of measure but I think you should start with non overlapping areas pixel count (it is easy to compute).
Basically you count pixels that are present only in one ROI (region of interest) mask.
estimate position
as you got the mesh then you know its size so place it in the GCS so rendered image has very close bounding box to real image. If you do not have FOV parameters then you need to rescale and translate each rendered image so it matches images bounding box (and as result you obtain only orientation not position of object of coarse). Cameras have perspective so the more far from camera you place your object the smaller it will be.
fit orientation
render few fixed orientations covering all orientations with some step 8^3 orientations. For each compute the difference of silhouette and chose orientation with smallest difference.
Then fit the orientation angles around it to minimize difference. If you do not know how optimization or fitting works see this:
How approximation search works
Beware too small amount of initial orientations can cause false positioves or missed solutions. Too high amount will be slow.
Now that was some basics in a nutshell. As your mesh is not very simple you may need to tweak this like use contours instead of silhouettes and using distance between contours instead of non overlapping pixels count which is really hard to compute ... You should start with simpler meshes like dice , coin etc ... and when grasping all of this move to more complex shapes ...
[Edit1] algebraic approach
If you know some points in the image that coresponds to known 3D points (in your mesh) then you can along with the FOV of the camera used compute the transform matrix placing your object ...
if the transform matrix is M (OpenGL style):
M = xx,yx,zx,ox
xy,yy,zy,oy
xz,yz,zz,oz
0, 0, 0, 1
Then any point from your mesh (x,y,z) is transformed to global world (x',y',z') like this:
(x',y',z') = M * (x,y,z)
The pixel position (x'',y'') is done by camera FOV perspective projection like this:
y''=FOVy*(z'+focus)*y' + ys2;
x''=FOVx*(z'+focus)*x' + xs2;
where camera is at (0,0,-focus), projection plane is at z=0 and viewing direction is +z so for any focal length focus and screen resolution (xs,ys):
xs2=xs*0.5;
ys2=ys*0.5;
FOVx=xs2/focus;
FOVy=ys2/focus;
When put all this together you obtain this:
xi'' = ( xx*xi + yx*yi + zx*zi + ox ) * ( xz*xi + yz*yi + zz*zi + ox + focus ) * FOVx
yi'' = ( xy*xi + yy*yi + zy*zi + oy ) * ( xz*xi + yz*yi + zz*zi + oy + focus ) * FOVy
where (xi,yi,zi) is i-th known point 3D position in mesh local coordinates and (xi'',yi'') is corresponding known 2D pixel positions. So unknowns are the M values:
{ xx,xy,xz,yx,yy,yx,zx,zy,zz,ox,oy,oz }
So we got 2 equations per each known point and 12 unknowns total. So you need to know 6 points. Solve the system of equations and construct your matrix M.
Also you can exploit that M is a uniform orthogonal/orthonormal matrix so vectors
X = (xx,xy,xz)
Y = (yx,yy,yz)
Z = (zx,zy,zz)
Are perpendicular to each other so:
(X.Y) = (Y.Z) = (Z.X) = 0.0
Which can lower the number of needed points by introducing these to your system. Also you can exploit cross product so if you know 2 vectors the thirth can be computed
Z = (X x Y)*scale
So instead of 3 variables you need just single scale (which is 1 for orthonormal matrix). If I assume orthonormal matrix then:
|X| = |Y| = |Z| = 1
so we got 6 additional equations (3 x dot, and 3 for cross) without any additional unknowns so 3 point are indeed enough.

Measure distance to object with a single camera in a static scene

let's say I am placing a small object on a flat floor inside a room.
First step: Take a picture of the room floor from a known, static position in the world coordinate system.
Second step: Detect the bottom edge of the object in the image and map the pixel coordinate to the object position in the world coordinate system.
Third step: By using a measuring tape measure the real distance to the object.
I could move the small object, repeat this three steps for every pixel coordinate and create a lookup table (key: pixel coordinate; value: distance). This procedure is accurate enough for my use case. I know that it is problematic if there are multiple objects (an object could cover an other object).
My question: Is there an easier way to create this lookup table? Accidentally changing the camera angle by a few degrees destroys the hard work. ;)
Maybe it is possible to execute the three steps for a few specific pixel coordinates or positions in the world coordinate system and perform some "calibration" to calculate the distances with the computed parameters?
If the floor is flat, its equation is that of a plane, let
a.x + b.y + c.z = 1
in the camera coordinates (the origin is the optical center of the camera, XY forms the focal plane and Z the viewing direction).
Then a ray from the camera center to a point on the image at pixel coordinates (u, v) is given by
(u, v, f).t
where f is the focal length.
The ray hits the plane when
(a.u + b.v + c.f) t = 1,
i.e. at the point
(u, v, f) / (a.u + b.v + c.f)
Finally, the distance from the camera to the point is
p = √(u² + v² + f²) / (a.u + b.v + c.f)
This is the function that you need to tabulate. Assuming that f is known, you can determine the unknown coefficients a, b, c by taking three non-aligned points, measuring the image coordinates (u, v) and the distances, and solving a 3x3 system of linear equations.
From the last equation, you can then estimate the distance for any point of the image.
The focal distance can be measured (in pixels) by looking at a target of known size, at a known distance. By proportionality, the ratio of the distance over the size is f over the length in the image.
Most vision libraries (including opencv) have built in functions that will take a couple points from a camera reference frame and the related points from a Cartesian plane and generate your warp matrix (affine transformation) for you. (some are fancy enough to include non-linearity mappings with enough input points, but that brings you back to your time to calibrate issue)
A final note: most vision libraries use some type of grid to calibrate off of ie a checkerboard patter. If you wrote your calibration to work off of such a sheet, then you would only need to measure distances to 1 target object as the transformations would be calculated by the sheet and the target would just provide the world offsets.
I believe what you are after is called a Projective Transformation. The link below should guide you through exactly what you need.
Demonstration of calculating a projective transformation with proper math typesetting on the Math SE.
Although you can solve this by hand and write that into your code... I strongly recommend using a matrix math library or even writing your own matrix math functions prior to resorting to hand calculating the equations as you will have to solve them symbolically to turn it into code and that will be very expansive and prone to miscalculation.
Here are just a few tips that may help you with clarification (applying it to your problem):
-Your A matrix (source) is built from the 4 xy points in your camera image (pixel locations).
-Your B matrix (destination) is built from your measurements in in the real world.
-For fast recalibration, I suggest marking points on the ground to be able to quickly place the cube at the 4 locations (and subsequently get the altered pixel locations in the camera) without having to remeasure.
-You will only have to do steps 1-5 (once) during calibration, after that whenever you want to know the position of something just get the coordinates in your image and run them through step 6 and step 7.
-You will want your calibration points to be as far away from eachother as possible (within reason, as at extreme distances in a vanishing point situation, you start rapidly losing pixel density and therefore source image accuracy). Make sure that no 3 points are colinear (simply put, make your 4 points approximately square at almost the full span of your camera fov in the real world)
ps I apologize for not writing this out here, but they have fancy math editing and it looks way cleaner!
Final steps to applying this method to this situation:
In order to perform this calibration, you will have to set a global home position (likely easiest to do this arbitrarily on the floor and measure your camera position relative to that point). From this position, you will need to measure your object's distance from this position in both x and y coordinates on the floor. Although a more tightly packed calibration set will give you more error, the easiest solution for this may simply be to have a dimension-ed sheet(I am thinking piece of printer paper or a large board or something). The reason that this will be easier is that it will have built in axes (ie the two sides will be orthogonal and you will just use the four corners of the object and used canned distances in your calibration). EX: for a piece of paper your points would be (0,0), (0,8.5), (11,8.5), (11,0)
So using those points and the pixels you get will create your transform matrix, but that still just gives you a global x,y position on axes that may be hard to measure on (they may be skew depending on how you measured/ calibrated). So you will need to calculate your camera offset:
object in real world coords (from steps above): x1, y1
camera coords (Xc, Yc)
dist = sqrt( pow(x1-Xc,2) + pow(y1-Yc,2) )
If it is too cumbersome to try to measure the position of the camera from global origin by hand, you can instead measure the distance to 2 different points and feed those values into the above equation to calculate your camera offset, which you will then store and use anytime you want to get final distance.
As already mentioned in the previous answers you'll need a projective transformation or simply a homography. However, I'll consider it from a more practical view and will try to summarize it short and simple.
So, given the proper homography you can warp your picture of a plane such that it looks like you took it from above (like here). Even simpler you can transform a pixel coordinate of your image to world coordinates of the plane (the same is done during the warping for each pixel).
A homography is basically a 3x3 matrix and you transform a coordinate by multiplying it with the matrix. You may now think, wait 3x3 matrix and 2D coordinates: You'll need to use homogeneous coordinates.
However, most frameworks and libraries will do this handling for you. What you need to do is finding (at least) four points (x/y-coordinates) on your world plane/floor (preferably the corners of a rectangle, aligned with your desired world coordinate system), take a picture of them, measure the pixel coordinates and pass both to the "find-homography-function" of your desired computer vision or math library.
In OpenCV that would be findHomography, here an example (the method perspectiveTransform then performs the actual transformation).
In Matlab you can use something from here. Make sure you are using a projective transformation as transform type. The result is a projective tform, which can be used in combination with this method, in order to transform your points from one coordinate system to another.
In order to transform into the other direction you just have to invert your homography and use the result instead.

How to transform filter when using FFT to do 2d convolution?

I want to use FFT to accelerate 2D convolution. The filter is 15 x 15 and the image is 300 x 300. The filter's size is different with image so I can not doing dot product after FFT. So how to transform the filter before doing FFT so that its size can be matched with image?
I use the convention that N is kernel size.
Knowing the convolution is not defined (mathematically) on the edges (N//2 at each end of each dimension), you would loose N pixels in totals on each axis.
You need to make room for convolution : pad the image with enough "neutral values" so that the edge cases (junk values inserted there) disappear.
This would involve making your image a 307x307px image (with suitable padding values, see next paragraph), which after convolution gives back a 300x300 image.
Popular image processing libraries have this already embedded : when you ask for a convolution, you have extra arguments specifying the "mode".
Which values can we pad with ?
Stolen with no shame from Numpy's pad documentation
'constant' : Pads with a constant value.
'edge' : Pads with the edge values of array.
'linear_ramp' : Pads with the linear ramp between end_value and the arraydge value.
'maximum' :
Pads with the maximum value of all or part of the
vector along each axis.
'mean'
Pads with the mean value of all or part of the
vector along each axis.
'median'
Pads with the median value of all or part of the
vector along each axis.
'minimum'
Pads with the minimum value of all or part of the
vector along each axis.
'reflect'
Pads with the reflection of the vector mirrored on
the first and last values of the vector along each
axis.
'symmetric'
Pads with the reflection of the vector mirrored
along the edge of the array.
'wrap'
Pads with the wrap of the vector along the axis.
The first values are used to pad the end and the
end values are used to pad the beginning.
It's up to you, really, but the rule of thumb is "choose neutral values for the task at hand".
(For instance, padding with 0 when doing averaging makes little sense, because 0 is not neutral in an average of positive values)
it depends on the algorithm you use for the FFT, because most of them need to work with images of dyadic dimensions (power of 2).
Here is what you have to do:
Padding image: center your image into a bigger one with dyadic dimensions
Padding kernel: center you convolution kernel into an image with same dimensions as step 1.
FFT on the image from step 1
FFT on the kernel from step 2
Complex multiplication (Fourier space) of results from steps 3 and 4.
Inverse FFT on the resulting image on step 5
Unpadding on the resulting image from step 6
Put all 4 blocs into the right order.
If the algorithm you use does not need dyadic dimensions, then steps 1 is useless and 2 has to be a simple padding with the image dimensions.

SIFT matches and recognition?

I am developing an application where I am using SIFT + RANSAC and Homography to find an object (OpenCV C++,Java). The problem I am facing is that where there are many outliers RANSAC performs poorly.
For this reasons I would like to try what the author of SIFT said to be pretty good: voting.
I have read that we should vote in a 4 dimension feature space, where the 4 dimensions are:
Location [x, y] (someone says Traslation)
Scale
Orientation
While with opencv is easy to get the match scale and orientation with:
cv::Keypoints.octave
cv::Keypoints.angle
I am having hard time to understand how I can calculate the location.
I have found an interesting slide where with only one match we are able to draw a bounding box:
But I don't get how I could draw that bounding box with just one match. Any help?
You are looking for the largest set of matched features that fit a geometric transformation from image 1 to image 2. In this case, it is the similarity transformation, which has 4 parameters: translation (dx, dy), scale change ds, and rotation d_theta.
Let's say you have matched to features: f1 from image 1 and f2 from image 2. Let (x1,y1) be the location of f1 in image 1, let s1 be its scale, and let theta1 be it's orientation. Similarly you have (x2,y2), s2, and theta2 for f2.
The translation between two features is (dx,dy) = (x2-x1, y2-y1).
The scale change between two features is ds = s2 / s1.
The rotation between two features is d_theta = theta2 - theta1.
So, dx, dy, ds, and d_theta are the dimensions of your Hough space. Each bin corresponds to a similarity transformation.
Once you have performed Hough voting, and found the maximum bin, that bin gives you a transformation from image 1 to image 2. One thing you can do is take the bounding box of image 1 and transform it using that transformation: apply the corresponding translation, rotation and scaling to the corners of the image. Typically, you pack the parameters into a transformation matrix, and use homogeneous coordinates. This will give you the bounding box in image 2 corresponding to the object you've detected.
When using the Hough transform, you create a signature storing the displacement vectors of every feature from the template centroid (either (w/2,h/2) or with the help of central moments).
E.g. for 10 SIFT features found on the template, their relative positions according to template's centroid is a vector<{a,b}>. Now, let's search for this object in a query image: every SIFT feature found in the query image, matched with one of template's 10, casts a vote to its corresponding centroid.
votemap(feature.x - a*, feature.y - b*)+=1 where a,b corresponds to this particular feature vector.
If some of those features cast successfully at the same point (clustering is essential), you have found an object instance.
Signature and voting are reverse procedures. Let's assume V=(-20,-10). So during searching in the novel image, when the two matches are found, we detect their orientation and size and cast a respective vote. E.g. for the right box centroid will be V'=(+20*0.5*cos(-10),+10*0.5*sin(-10)) away from the SIFT feature because it is in half size and rotated by -10 degrees.
To complete Dima's , one needs to add that the 4D Hough space is quantized into a (possibly small) number of 4D boxes, where each box corresponds to the simiéarity given by its center.
Then, for each possible similarity obtained via a tentative matching of features, add 1 into the corresponding box (or cell) in the 4D space. The output similarity is given by the cell with the more votes.
In order to computethe transform from 1 match, just use Dima's formulas in his answer. For several pairs of matches, you may need to use some least squares fit.
Finally, the transform can be applied with the function cv::warpPerspective(), where the third line of the perspective matrix is set to [0,0,1].

How can I transform an image using matrices R and T (extrinsic parameters matrices) in opencv?

I have a rotation-translation matrix [R T] (3x4).
Is there a function in opencv that performs the rotation-translation described by [R T]?
A lot of solutions to this question I think make hidden assumptions. I will try to give you a quick summary of how I think about this problem (I have had to think about it a lot in the past). Warping between two images is a 2 dimensional process accomplished by a 3x3 matrix called a homography. What you have is a 3x4 matrix which defines a transform in 3 dimensions. You can convert between the two by treating your image as a flat plane in 3 dimensional space. The trick then is to decide on the initial position in world space of your image plane. You can then transform its position and project it onto a new image plane with your camera intrinsics matrix.
The first step is to decide where your initial image lies in world space, note that this does not have to be the same as your initial R and T matrices specify. Those are in world coordinates, we are talking about the image created by that world, all the objects in the image have been flattened into a plane. The simplest decision here is to set the image at a fixed displacement on the z axis and no rotation. From this point on I will assume no rotation. If you would like to see the general case I can provide it but it is slightly more complicated.
Next you define the transform between your two images in 3d space. Since you have both transforms with respect to the same origin, the transform from [A] to [B] is the same as the transform from [A] to your origin, followed by the transform from the origin to [B]. You can get that by
transform = [B]*inverse([A])
Now conceptually what you need to do is to take your first image, project its pixels onto the geometric interpretation of your image in 3d space, then transform those pixels in 3d space by the transform above, then project them back onto a new 2d image with your camera matrix. Those steps need to be combined into a single 3x3 matrix.
cv::Matx33f convert_3x4_to_3x3(cv::Matx34f pose, cv::Matx33f camera_mat, float zpos)
{
//converted condenses the 3x4 matrix which transforms a point in world space
//to a 3x3 matrix which transforms a point in world space. Instead of
//multiplying pose by a 4x1 3d homogeneous vector, by specifying that the
//incoming 3d vectors will ALWAYS have a z coordinate of zpos, one can instead
//multiply converted by a homogeneous 2d vector and get the same output for x and y.
cv::Matx33f converted(pose(0,0),pose(0,1),pose(0,2)*zpos+pose(0,3),
pose(1,0),pose(1,1),pose(1,2)*zpos+pose(1,3),
pose(2,0),pose(2,1),pose(2,2)*zpos+pose(2,3));
//This matrix will take a homogeneous 2d coordinate and "projects" it onto a
//flat plane at zpos. The x and y components of the incoming homogeneous 2d
//coordinate will be correct, the z component is dropped.
cv::Matx33f projected(1,0,0,
0,1,0,
0,0,zpos);
projected = projected*camera_mat.inv();
//now we have the pieces. A matrix which can take an incoming 2d point, and
//convert it into a pseudo 3d point (x and y correspond to 3d, z is unused)
//and a matrix which can take our pseudo 3d point and transform it correctly.
//Now we just need to turn our transformed pseudo 3d point back into a 2d point
//in our new image, to do that simply multiply by the camera matrix.
return camera_mat*converted*projected;
}
This is probably a more complicated answer than you were looking for but I hope it gives you an idea of what you are asking. This can be very confusing and I glazed over some parts of it quickly, feel free to ask for clarification. If you need the solution to work without the assumption that the initial image appears without rotation let me know, I just didn't want to make it more complicated than it needed to be.

Resources