I am trying to build a program which detects offside situation in a football video sequence. In order to track better players and ball I need to estimate the homography between consecutive frames. I am doing this project in Matlab.
I am able to find enough corresponding lines between frames but it seems to me that the resulting homography isn't correct.
I start from the following situation, where I have these two processed images (1280x720 px) with corresponding lines:
image 1 and image 2.
Lines derive from the Hough transform and are of the form cross(P1, P2), where P(i) is [x y 1]', with 0 < x,y < 1 (devided by the image width and height). Lines are normalized too, devided by the third component).
Before lines normalization (just after cross product) I have:
Lines from frame 1 (one line per row).
[ -0.9986 -0.2992 0.6792
-0.9986 -0.4305 0.5686
-0.8000 -0.4500 0.3613
-0.9986 -0.1609 0.7890
-0.9986 -0.0344 0.9074
-0.2500 -0.2164 0.0546]
These are lines from frame 2:
[-0.9986 -0.2984 0.6760
-0.9986 -0.4313 0.5678
-0.7903 -0.4523 0.3587
-0.9986 -0.1609 0.7890
-0.9986 -0.0391 0.9066
-0.2486 -0.2148 0.0539]
After normalization for each mathching line (in this case all rows correspond) I create matrix A(j)
[-u 0 u*x -v 0 v*x -1 0 x];
[0 -u u*y 0 -v v*y 0 -1 y];
where line(j)_1 is [x y 1]' and line(j)_2 is [u v 1]'. Then I form the entire matrix A and calculate SVD
[~,~,V] = svd(A);. Rearranging the last column of V as a 3x3 matrix will give H as:
[0.4234 0.0024 -0.3962
-0.3750 -0.0030 0.3503
0.4622 0.0029 -0.4322]
This homography matrix works quite well for the parallel lines above and the vanishing point (intersection of those lines) but it does a terrible job elsewhere. For example one vanishing point is in unscaled coordinates (1194.2, -607.4), it is supposed to stay there and in fact will be mapped few pixel around (5~10px). But if I take a random point in (300, 300) will go to (1174.1, -582.7)!
I can't see if I did some big mistake or it is because the noise in the measurements. Can you help me?
Well, you computed a homography mapping lines to lines. If you want the corresponding pointwise homography you need to invert and transpose it. See, for example, Chapter 1.3.1 of Hartley and Zisserman's "Multiple View Geometry".
From the images you posted, it looks like that the lines you are considering are all parallel to each other in the scene. Then the problem is ill-posed because there are an infinite amount of homographies which explain the resulting correspondences. Try adding to your correspondences lines which have other directions.
Related
I'm filming with 6 RGB cameras a scene that I want to reconstruct in 3D, kind of like in the following picture. And I forgot to take a calibration chessboard. So I used a blank rectangle board instead and filmed it, as I would film a regular chessboard.
First step, calibration --> OK.
I obviously couldn't use cv2.findChessboardCorners, so I made a small program that would allow me to click and store the location of each 4 corners. I calibrated from these 4 points for about 10-15 frames as a test.
Tl;Dr: It seemed to work great.
Next step, triangulation. --> NOT OK
I use direct linear transform (DLT) to triangulate my points from all 6 cameras.
Tl;Dr: It's not working so well.
Image and world coordinates are connected this way: ,
which can be written .
A singular value decomposition (SVD) gives
3 of the 4 points are correctly triangulated, but the blue one that should lie on the origin has a wrong x coordinate.
WHY?
Why only one point, and why only the x coordinate?
Does it have anything to do with the fact that I calibrate from a 4 points board?
If so, can you explain; and if not, what else could it be?
Update: I tried for an other frame while the board is somewhere else, and the triangulation is fine.
So there is the mystery: some points are randomly triangulated wrong (or at least the one at the origin), while most of the others are fine. Again, why?
My guess is that it comes from the triangulation rather than from the calibration, and that there is no connexion with my sloppy calibration process.
One common issue I came across is the ambiguity in the solutions found by DLT. Indeed, solving AQ = 0 or solving AC C-¹Q gives the same solution. See page 46 here. But I don't know what to do about it.
I'm now fairly sure this is not a calibration issue but I don't want to delete this part of my post.
I used ret, K, D, R, T = cv2.calibrateCamera(objpoints, imgpoints, imSize, None, None). It worked seamlessly, and the points where
perfectly reprojected on my original image with
cv2.projectPoints(objpoints, R, T, K, D).
I computed my projection matrix P as , and R, _ = cv2.Rodrigues(R)
How is it that I get a solution while I have only 4 points per image?
Wouldn't I need 6 of them at least? We have .We
can solve P by SVD under the form This is 2
equations per point, for 11 independent unknown P parameters. So 4
points make 8 equations, which shouldn't be enough. And yet
cv2.calibrateCamera still gives a solution. It must be using
another method? I came across Perspective-n-Point (PnP), is it what
opencv uses? In which case, is it directly optimizing K, R, and T and
thus needs less points?I could artificially add a few points
to get more than the 4 corner points of my board (for example, the
centers of the edges, or the center of the rectangle). But is it
really the issue?
When calibrating, one needs to decompose the projection matrix into
intrinsic and extrinsic matrices. But this decomposition is not
unique and has 4 solutions. See there section 'I'm seeing
double' and Chapt.21 of Hartley&Zisserman about Cheirality
for more information. It is not my issue since my camera points
are correctly reprojected to the image plane and my cameras are
correctly set up on my 3D scene.
I did not quite understand what you are asking, it is rather vague. However, I think you are miscalculating your projection matrix.
if I'm not mistaken, you will surely define 4 3D points representing your rectangle in real world space in this way for example:
pt_3D = [[ 0 0 0]
[ 0 1 0]
[ 1 1 0]
[ 1 0 0]]
you will then retrieve the corresponding 2D points (in order) of each image, and generate two vectors as follows:
objpoints = [pt_3D, pt_3D, ....] # N times
imgpoints = [pt_2D_img1, pt_3D_img2, ....] # N times ( N images )
You can then calibrate your camera and recover the camera poses as well as the projection matrices as follows:
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, imSize, None, None)
cv2.projectPoints(objpoints, rvecs, tvecs, K, dist)
for rvec, tvec in zip(rvecs, tvecs):
Rt, _ = cv2.Rodrigues(rvec)
R = Rt.T
T = - R # tvec
pose_Matrix = np.vstack(( np.hstack((R,T)) , [0, 0, 0, 1])) ( transformation matrix == camera pose )
Projection_Matrix = K # TransformationMatrix.T[:3, :4]
You don't have to apply the DLT or the triangulation (all is done in the cv2.calibrateCamera () function, and the 3D points remain what you define yourself
I have a question regarding the meaning of the elements from an projective transformation matrix e.g. in an homography used by OpenCv warpPerspective.
I know the basic of an affin transformation, but here I'm more interested in the projective transformation, meaning in the below shown matrix the elements A31 and A32:
A11 A12 A13
A21 A22 A23
A31 A32 1
I played around with the values a bit which means having a fixed numbers for all other element. Meaning:
1 0 0
0 1 0
A31 A32 1
to have just the projective elements.
But what exactly causing the elements A31 and A32 ? Like A13 and A23 are responsible for the horizontal and vertical translation.
Is there an simple explanation for this two elements? Like having a positive value means ...., having a negativ value meaning ... . S.th. like that.
Hope anyone can help me.
Newton's descriptions are correct, but it might be helpful to actually see the transformations to understand what's going on, and how they might work together with other values in the transformation matrix to make a bit more sense. I'll give some python/OpenCV examples with animations to show what these values do.
import numpy as np
import cv2
img = cv2.imread('img1.png')
h, w = img.shape[:2]
# initializations
max_m20 = 2e-3
nsteps = 50
M = np.eye(3)
So here I'm setting the transformation matrix to be the identity (no transformation). We want to see the effect of changing the element at (2, 0) in the transformation matrix M, so we'll animate by looping through nsteps linearly spaced between 0 to max_m20.
for m20 in np.linspace(0, max_m20, nsteps):
M[2, 0] = m20
warped = cv2.warpPerspective(img, M, (w, h))
cv2.imshow('warped', warped)
k = cv2.waitKey(1)
if k == ord('q') & 0xFF:
break
I applied this on an image taken from Oxford's Visual Geometry Group.
So indeed, we can see that this is similar to either rotating your camera around a point that is aligned with the left edge of the image, or rotating the image itself around an axis. However, it is a little different than that. Note that the top edge stays along the top the whole time, which is a little strange. Instead of we rotate around an axis like above, we would imagine that the top edge would start to come down on the right edge too. Like this:
Well, if you're thinking about transformations, one easy way to get this transformation is to take the transformation above, and add some skew distortion so that the right top side is being pushed down as that bottom right corner is being pushed up. And that's actually exactly how this view was created:
M = np.eye(3)
max_m20 = 2e-3
max_m10 = 0.6
for m20, m10 in zip(np.linspace(0, max_m20, nsteps), np.linspace(0, max_m10, nsteps)):
M[2, 0] = m20
M[1, 0] = m10
warped = cv2.warpPerspective(img, M, (w, h))
cv2.imshow('warped', warped)
k = cv2.waitKey(1)
if k == ord('q') & 0xFF:
break
So the right way to think about the perspective in these matrices is, IMO, with the skew entries and the last row together. Those are the two places in the homography matrix where angles actually get modified*; otherwise, it's just rotation, scaling, and translation---all of which are angle preserving.
*Note: Actually, angles can be changed in one more way that I didn't mention. Affine transformations allow for non-uniform scaling, which means you can stretch a shape in width and not in height or vice-versa, which would also change the angles. Imagine if you had a triangle and stretched it only in width; the angles would change. So it turns out that non-uniform scaling (i.e. when the first and middle element of the transformation matrix are different values) can also modify angles in addition to the perspective change and shearing distortions.
Note that in these examples, the same applies to the second entry in the last row with the other skew location; the only difference is it happens at the top instead of the left side. Negative values in both cases is akin to rotating the plane along that axis towards, instead of farther away from, the camera.
The 3x1 ,3x2 elements of homography matrix change the plane of the image. Thats the difference between Affine and Homography matrices. For instance consider this- The A31 changes the plane of your image along the left edge. Its like sticking your image to a stick like a flag and rotating. The positive is clock wise and the negative is reverse. The other element does the same from the top edge. But together, they set a plane for your image. That's the simplest way i could put it.
I have a 96x96 pixel grayscale facial images. i am trying to find the eye centers and lip corners. I applied one gabor filter (theta=pi/2, lamda=1.50) on the facial image and after convolving i get the filter output like this.
As you can see from the gabor output eyes and mouth corners are clearly distinguishable. i apply scikit kmeans clustering to group pixels together to 4 clusters (2 eyes and 2 lip corner)
data = output.reshape(-1,96*96)
estimator = KMeans(n_clusters=4)
estimator.fit(data)
centroids = np.asarray(estimator.cluster_centers_)
print 'Cluster centers', centroids.shape
print 'Labels', estimator.labels_, estimator.labels_.shape
Output
Input X,y: (100, 96, 96) (1783, 1)
Gabor Filters (1, 9, 9)
Final output X,y (100, 96, 96) (0,)
Shape estimator.cluster_centers_: (4, 9216)
Now comes the question: How do i plot the centroids x,y coordinates of the 4 cluster centers? Will i see the eye centers and mouth corners
Further information: I plot the estimator.cluster_centers_ and the output is like a code book. i see no coordinates of cluster centroids.
I am using the steps as described in this paper: http://jyxy.tju.edu.cn/Precision/MOEMS/doc/p36.pdf
I think there's some confusion here about the space in which you're doing your K-means clustering. In the code snippet that you included in your question, you're training up a KMeans model using the vectorized face images as data points. K-means clusters live in the same space as the data you give it, so (as you noticed) your cluster centroids will also be vectorized face images. Importantly, these face images have dimension 9216, not dimension 2 (i.e., x-y coordinates)!
To get 2-dimensional (x, y) coordinates as your K-means centroids, you need to run the algorithm using 2-dimensional input data. Just off the top of my head, it seems like you could apply a darkness threshold to your face images and assemble a clustering dataset of only the dark pixel locations. Then after you run K-means on this dataset, the centroids will hopefully be close to the pixel locations in your face images where there are the most dark pixels. These locations (assuming the face images in your training data were already somewhat registered) should be somewhat close to the eyes and mouth corners that you're hoping for.
This can be really confusing so I'll try to add an example. Let's say just for an example you have "face images" that are 3 pixels wide by 4 pixels tall. After thresholding the pixels in one of your images it might look like:
0 1 2 <-- x coordinates
0 0 0 0 ^ y coordinates
0 1 0 1 |
1 0 0 2 |
0 0 1 3 v
If you use this "image" directly in K-means, you're really running your K-means algorithm in a 12-dimensional space, and the image above would be vectorized as:
0 0 0 0 1 0 1 0 0 0 0 1
Then your K-means cluster centroids will also live in this same 12-dimensional space.
What I'm trying to suggest is that you could extract the (x, y) coordinates of the 1s in each image, and use those as the data for your K-means algorithm. So for the example image above, you'd get the following data points:
1 1
0 2
2 3
In this example, we've extracted 3 2-dimensional points from this one "image"; with more images you'd get more 2-dimensional points. After you run K-means with these 2-dimensional data points, you'll get cluster centroids that can also be interpreted as pixel locations in the original images. You could plot those centroid locations on top of your images and see where they correspond in the images.
I need an inverse perspective transform written in Pascal/Delphi/Lazarus. See the following image:
I think I need to walk through destination pixels and then calculate the corresponding position in the source image (To avoid problems with rounding errors etc.).
function redraw_3d_to_2d(sourcebitmap:tbitmap, sourceaspect:extended, point_a, point_b, point_c, point_d:tpoint, megapixelcount:integer):tbitmap;
var
destinationbitmap:tbitmap;
x,y,sx,sy:integer;
begin
destinationbitmap:=tbitmap.create;
destinationbitmap.width=megapixelcount*sourceaspect*???; // I dont how to calculate this
destinationbitmap.height=megapixelcount*sourceaspect*???; // I dont how to calculate this
for x:=0 to destinationbitmap.width-1 do
for y:=0 to destinationbitmap.height-1 do
begin
sx:=??;
sy:=??;
destinationbitmap.canvas.pixels[x,y]=sourcebitmap.canvas.pixels[sx,sy];
end;
result:=destinationbitmap;
end;
I need the real formula... So an OpenGL solution would not be ideal...
Note: There is a version of this with proper math typesetting on the Math SE.
Computing a projective transformation
A perspective is a special case of a projective transformation, which in turn is defined by four points.
Step 1: Starting with the 4 positions in the source image, named (x1,y1) through (x4,y4), you solve the following system of linear equations:
[x1 x2 x3] [λ] [x4]
[y1 y2 y3]∙[μ] = [y4]
[ 1 1 1] [τ] [ 1]
The colums form homogenous coordinates: one dimension more, created by adding a 1 as the last entry. In subsequent steps, multiples of these vectors will be used to denote the same points. See the last step for an example of how to turn these back into two-dimensional coordinates.
Step 2: Scale the columns by the coefficients you just computed:
[λ∙x1 μ∙x2 τ∙x3]
A = [λ∙y1 μ∙y2 τ∙y3]
[λ μ τ ]
This matrix will map (1,0,0) to a multiple of (x1,y1,1), (0,1,0) to a multiple of (x2,y2,1), (0,0,1) to a multiple of (x3,y3,1) and (1,1,1) to (x4,y4,1). So it will map these four special vectors (called basis vectors in subsequent explanations) to the specified positions in the image.
Step 3: Repeat steps 1 and 2 for the corresponding positions in the destination image, in order to obtain a second matrix called B.
This is a map from basis vectors to destination positions.
Step 4: Invert B to obtain B⁻¹.
B maps from basis vectors to the destination positions, so the inverse matrix maps in the reverse direction.
Step 5: Compute the combined Matrix C = A∙B⁻¹.
B⁻¹ maps from destination positions to basis vectors, while A maps from there to source positions. So the combination maps destination positions to source positions.
Step 6: For every pixel (x,y) of the destination image, compute the product
[x'] [x]
[y'] = C∙[y]
[z'] [1]
These are the homogenous coordinates of your transformed point.
Step 7: Compute the position in the source image like this:
sx = x'/z'
sy = y'/z'
This is called dehomogenization of the coordinate vector.
All this math would be so much easier to read and write if SO were to support MathJax… ☹
Choosing the image size
The above aproach assumes that you know the location of your corners in the destination image. For these you have to know the width and height of that image, which is marked by question marks in your code as well. So let's assume the height of your output image were 1, and the width were sourceaspect. In that case, the overall area would be sourceaspect as well. You have to scale that area by a factor of pixelcount/sourceaspect to achieve an area of pixelcount. Which means that you have to scale each edge length by the square root of that factor. So in the end, you have
pixelcount = 1000000.*megapixelcount;
width = round(sqrt(pixelcount*sourceaspect));
height = round(sqrt(pixelcount/sourceaspect));
Use Graphics32, specifically TProjectiveTransformation (to use with the Transform method). Don't forget to leave some transparent margin in your source image so you don't get jagged edges.
(This is a follow-up from this previous question).
I was able to successfully use OpenCV / Hough transforms to detect lines in pictures (scanned text); at first it would detect many many lines (at least one line per line of text), but by adjusting the 'threshold' parameter via trial-and-error, it now only detects "real" lines.
(The 'threshold' parameter is dependant on image size, which is a bit of a problem if one has to deal with images of different resolutions, but that's another story).
My problem is that the Hough transform sometimes detects two lines where there is only one; those two lines are very near one another and (apparently) parallel.
=> How can I identify that two lines are almost parallel and very near one another? (so that I can keep only one).
If you use the standard or multiscale hough, you will end up with the rho and theta coordinates of the lines in polar coordinates. Rho is the distance to the origin, and theta is normally the angle between the detected line and the Y axis. Without looking into the details of the hough transform in opencv, this is a general rule in those coordinates: two lines will be almost parallel and very near one another when:
- their thetas are nearly identical AND their rhos are nearly identical
OR
- their thetas are near 180 degrees apart AND their rhos are near each other's negative
I hope that makes sense.
That's interesting about the theta being the angle between the line and the y-axis.
Generally, the rho and theta values are visualized as being the angle from the x-axis to the line perpendicular to the line in question. The rho is then the length of this perpendicular line. Thus, a theta = 90 and rho = 20 would mean a horizontal line 20 pixels up from the origin.
A nice image is shown on Hough Transform question