What is meaning of 'Projection' in Deep Learning? - machine-learning

When i search for deep learning documents or paper,
sometimes i find 'Projection' on Deep learning.
but i don't know the definition of 'Projection'.
I think that the meaning of projection is change vector space to other vector space.
for example
# python code
X = np.zeros([200,100])
W = np.zeros([100,300])
XW = np.matmul(X,W) # XW shape is [200,300]
At this point, and i think X is projected onto XS.
Am I correct?

Sounds like you might be referring to vector projection. I like lights, so if you imagine sitting directly under a light with your arm on a table before you... Keeping your elbow on the table, and raising your hand until your arm is at an angle into the air, the projected vector would then be the shadow of your arm/hand on the table!

Related

Multiple camera view triangulation

Given a scene, multiple camera views of that scene and their corresponding projection matrices, we wish to triangulate 2D matching points into 3D.
What i've been doing so far is solve the system PX = alphax where P is the projection matrix, X is the 3D point in camera coordinates, alpha is a scalar and x is the vector corresponding to the point in 2D. X and x are in homogeneous coords.
See https://tspace.library.utoronto.ca/handle/1807/10437 page 102 for more detail.
Solving this with an SVD yields proper results when the 2D points are accurately selected or when i only use two views. Introducing more views adds a lot of error.
Any advice on what techniques are best to improve/refine this solution and make it support more views?
If I understand correctly, we can view this as finding a point in 3D space that minimizes the sum of orthogonal distances between the point and lines (one line per camera view) ? I guess with a gradient descent in 3d space, it's possible to find a local minimum of this.
Did I understand the problem correctly?

3d pose estimation in opencv

I have an object with some sensors on it with a known 3d location in a fixed orientation in relation to each other. Let's call this object the "detector".
I have a few of these sensors' detected locations in 3D world space.
Problem: how do I get an estimated pose (position and rotation) of the "detector" in 3D world space.
I tried looking into the npn problem, flann and orb matching and knn for the outliers, but it seems like they all expect a camera position of some sort. I have nothing to do with a camera and all I want is the pose of the "detector". considering that opencv is a "vision" library, do I even need opencv for this?
edit: Not all sensors might be detectec. Here indicated by the light-green dots.
Sorry this is kinda old but never too late to do object tracking :)
OpenCV SolvePNP RANSAC should work for you. You don't need to supply an initial pose, just make empty Mats for rvec and tvec to hold your results.
Also because there is no camera just use the Identity for the camera distortion arguments.
The first time calling PNP with empty rvec and tvec make sure to set useExtrinsicGuess = false. Save your results if good, use them for next frame with useExtrinsicGuess = true so it can optimize the function more quickly.
https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#solvepnpransac
You definitely do not need openCV to estimate the position of your object in space.
This is a simple optimization problem where you need to minimize a distance to the model.
First, you need to create a model of your object's attitude in space.
def Detector([x, y, z], [alpha, beta, gamma]):
which should return a list or array of all positions of your points with IDs in 3D space.
You can even create a class for each of these sensor points, and a class for the whole object which has as attributes as many sensors as you have on your object.
Then you need to build an optimization algorithm for fitting your model on the detected data.
The algorithm should use the attitude
x, y, z, alpha, beta, gamma as variables.
for the objective function, you can use something like a sum of distances to corresponding IDs.
Let's say you have a 3 points object that you want to fit on 3 data points
#Model
m1 = [x1, y1, z1]
m2 = [x2, y2, z2]
m3 = [x3, y3, z3]
#Data
p1 = [xp1, yp1, zp1]
p2 = [xp2, yp2, zp2]
p3 = [xp3, yp3, zp3]
import numpy as np
def distanceL2(pt1, pt2):
distance = np.sqrt((pt1[0]-pt2[0])**2 + (pt1[1]-pt2[1])**2 + (pt1[2]-pt2[2])**2))
# You already know you want to relate "1"s, "2"s and "3"s
obj_function = distance(m1, p1) + distance(m2,p2) + distance(m3,p3)
Now you need to dig into optimization libraries for finding the best algorithm to use, depending on how fast you need you optimization to be.
Since your points in space are virtually connected, this should not be too difficult. scipy.optimize could do it.
To reduce the dimensions of your problem, Try taking one of the 'detected' points as reference (as if this measure was to be trusted) and then find the minimum of the obj_function for this position (there are only 3 parameters left to optimize on, corresponding to orientation)
and then iterate for each of the points you have.
Once you have the optimum, you can try to look for a better position for this sensor around it and see if you reduce again the distance.

3D reconstruction from two calibrated cameras - where is the error in this pipeline?

There are many posts about 3D reconstruction from stereo views of known internal calibration, some of which are excellent. I have read a lot of them, and based on what I have read I am trying to compute my own 3D scene reconstruction with the below pipeline / algorithm. I'll set out the method then ask specific questions at the bottom.
0. Calibrate your cameras:
This means retrieve the camera calibration matrices K1 and K2 for Camera 1 and Camera 2. These are 3x3 matrices encapsulating each camera's internal parameters: focal length, principal point offset / image centre. These don't change, you should only need to do this once, well, for each camera as long as you don't zoom or change the resolution you record in.
Do this offline. Do not argue.
I'm using OpenCV's CalibrateCamera() and checkerboard routines, but this functionality is also included in the Matlab Camera Calibration toolbox. The OpenCV routines seem to work nicely.
1. Fundamental Matrix F:
With your cameras now set up as a stereo rig. Determine the fundamental matrix (3x3) of that configuration using point correspondences between the two images/views.
How you obtain the correspondences is up to you and will depend a lot on the scene itself.
I am using OpenCV's findFundamentalMat() to get F, which provides a number of options method wise (8-point algorithm, RANSAC, LMEDS).
You can test the resulting matrix by plugging it into the defining equation of the Fundamental matrix: x'Fx = 0 where x' and x are the raw image point correspondences (x, y) in homogeneous coordinates (x, y, 1) and one of the three-vectors is transposed so that the multiplication makes sense. The nearer to zero for each correspondence, the better F is obeying it's relation. This is equivalent to checking how well the derived F actually maps from one image plane to another. I get an average deflection of ~2px using the 8-point algorithm.
2. Essential Matrix E:
Compute the Essential matrix directly from F and the calibration matrices.
E = K2TFK1
3. Internal Constraint upon E:
E should obey certain constraints. In particular, if decomposed by SVD into USV.t then it's singular values should be = a, a, 0. The first two diagonal elements of S should be equal, and the third zero.
I was surprised to read here that if this is not true when you test for it, you might choose to fabricate a new Essential matrix from the prior decomposition like so: E_new = U * diag(1,1,0) * V.t which is of course guaranteed to obey the constraint. You have essentially set S = (100,010,000) artificially.
4. Full Camera Projection Matrices:
There are two camera projection matrices P1 and P2. These are 3x4 and obey the x = PX relation. Also, P = K[R|t] and therefore K_inv.P = [R|t] (where the camera calibration has been removed).
The first matrix P1 (excluding the calibration matrix K) can be set to [I|0] then P2 (excluding K) is R|t
Compute the Rotation and translation between the two cameras R, t from the decomposition of E. There are two possible ways to calculate R (U*W*V.t and U*W.t*V.t) and two ways to calculate t (±third column of U), which means that there are four combinations of Rt, only one of which is valid.
Compute all four combinations, and choose the one that geometrically corresponds to the situation where a reconstructed point is in front of both cameras. I actually do this by carrying through and calculating the resulting P2 = [R|t] and triangulating the 3d position of a few correspondences in normalised coordinates to ensure that they have a positive depth (z-coord)
5. Triangulate in 3D
Finally, combine the recovered 3x4 projection matrices with their respective calibration matrices: P'1 = K1P1 and P'2 = K2P2
And triangulate the 3-space coordinates of each 2d point correspondence accordingly, for which I am using the LinearLS method from here.
QUESTIONS:
Are there any howling omissions and/or errors in this method?
My F matrix is apparently accurate (0.22% deflection in the mapping compared to typical coordinate values), but when testing E against x'Ex = 0 using normalised image correspondences the typical error in that mapping is >100% of the normalised coordinates themselves. Is testing E against xEx = 0 valid, and if so where is that jump in error coming from?
The error in my fundamental matrix estimation is significantly worse when using RANSAC than the 8pt algorithm, ±50px in the mapping between x and x'. This deeply concerns me.
'Enforcing the internal constraint' still sits very weirdly with me - how can it be valid to just manufacture a new Essential matrix from part of the decomposition of the original?
Is there a more efficient way of determining which combo of R and t to use than calculating P and triangulating some of the normalised coordinates?
My final re-projection error is hundreds of pixels in 720p images. Am I likely looking at problems in the calibration, determination of P-matrices or the triangulation?
The error in my fundamental matr1ix estimation is significantly worse
when using RANSAC than the 8pt algorithm, ±50px in the mapping between
x and x'. This deeply concerns me.
Using the 8pt algorithm does not exclude using the RANSAC principle.
When using the 8pt algorithm directly which points do you use? You have to choose 8 (good) points by yourself.
In theory you can compute a fundamental matrix from any point correspondences and you often get a degenerated fundamental matrix because the linear equations are not independend. Another point is that the 8pt algorithm uses a overdetermined system of linear equations so that one single outlier will destroy the fundamental matrix.
Have you tried to use the RANSAC result? I bet it represents one of the correct solutions for F.
My F matrix is apparently accurate (0.22% deflection in the mapping
compared to typical coordinate values), but when testing E against
x'Ex = 0 using normalised image correspondences the typical error in
that mapping is >100% of the normalised coordinates themselves. Is
testing E against xEx = 0 valid, and if so where is that jump in error
coming from?
Again, if F is degenerated, x'Fx = 0 can be for every point correspondence.
Another reason for you incorrect E may be the switch of the cameras (K1T * E * K2 instead of K2T * E * K1). Remember to check: x'Ex = 0
'Enforcing the internal constraint' still sits very weirdly with me -
how can it be valid to just manufacture a new Essential matrix from
part of the decomposition of the original?
It is explained in 'Multiple View Geometry in Computer Vision' from Hartley and Zisserman. As far as I know it has to do with the minimization of the Frobenius norm of F.
You can Google it and there are pdf resources.
Is there a more efficient way of determining which combo of R and t to
use than calculating P and triangulating some of the normalised
coordinates?
No as far as I know.
My final re-projection error is hundreds of pixels in 720p images. Am
I likely looking at problems in the calibration, determination of
P-matrices or the triangulation?
Your rigid body transformation P2 is incorrect because E is incorrect.

Nature of relationship between optic flow and depth

Assuming the static scene, with a single camera moving exactly sideways at small distance, there are two frames and a following computed optic flow (I use opencv's calcOpticalFlowFarneback):
Here scatter points are detected features, which are painted in pseudocolor with depth values (red is little depth, close to the camera, blue is more distant). Now, I obtain those depth values by simply inverting optic flow magnitude, like d = 1 / flow. Seems kinda intuitive, in a motion-parallax-way - the brighter the object, the closer it is to the observer. So there's a cube, exposing a frontal edge and a bit of a side edge to the camera.
But then I'm trying to project those feature points from camera plane to the real-life coordinates to make a kind of top view map (where X = (x * d) / f and Y = d (where d is depth, x is pixel coordinate, f is focal length, and X and Y are real-life coordinates). And here's what I get:
Well, doesn't look cubic to me. Looks like the picture is skewed to the right. I've spent some time thinking about why, and it seems that 1 / flow is not an accurate depth metric. Playing with different values, say, if I use 1 / power(flow, 1 / 3), I get a better picture:
But, of course, power of 1 / 3 is just a magic number out of my head. The question is, what is the relationship between optic flow in depth in general, and how do I suppose to estimate it for a given scene? We're just considering camera translation here. I've stumbled upon some papers, but no luck trying to find a general equation yet. Some, like that one, propose a variation of 1 / flow, which isn't going to work, I guess.
Update
What bothers me a little is that simple geometry points me to 1 / flow answer too. Like, optic flow is the same (in my case) as disparity, right? Then using this formula I get d = Bf / (x2 - x1), where B is distance between two camera positions, f is focal length, x2-x1 is precisely the optic flow. Focal length is a constant, and B is constant for any two given frames, so that leaves me with 1 / flow again multiplied by a constant. Do I misunderstand something about what optic flow is?
for a static scene, moving a camera precisely sideways a known amount, is exactly the same as a stereo camera setup. From this, you can indeed estimate depth, if your system is calibrated.
Note that calibration in this sense is rather broad. In order to get real accurate depth, you will need to in the end supply a scale parameter on top of the regular calibration stuff you have in openCV, or else there is a single uniform ambiguity of the 3D (This last step is often called going to the "metric" reconstruction from only the "Euclidean").
Another thing which is apart of broad calibration is lens distortion compensation. Before anything else, you probably want to force your cameras to behave like pin-hole cameras (which real-world cameras usually dont).
With that said, optical flow is definetely very different from a metric depth map. If you properly calibraty and rectify your system first, then optical flow is still not equivalent to disparity estimation. If your system is rectified, there is no point in doing a full optical flow estimation (such as Farnebäck), because the problem is thereafter constrained along the horizontal lines of the image. Doing a full optical flow estimation (giving 2 d.o.f) will introduce more error after said rectification likely.
A great reference for all this stuff is the classic "Multiple View Geometry in Computer Vision"

Fast patch extraction using homography

Suppose you have an homography H relating a planar surface depicted in two different images, Ir and I. Ir is the reference image, where the planar surface is parallel to the image plane (and practically occupy the entire image). I is a run-time image (a photo of the planar surface taken at an arbitrary viewpoint). Let H be such that:
p = Hp', where p is a point in Ir and p' is the corresponding point in I.
Suppose you have two points p1=(x1,y) and p2=(x2,y), with x1 < x2, relative to the image Ir. Note that they belong to the same row (common y). Let H'=H^(-1). Using H', you can compute the corresponding points in I of the following points: (x1,y),(x1+1,y),...,(x2,y).
The question is: is there a way to avoid the matrix-vector multiplication to compute all those points? The easiest way that comes to me is to use the homography to compute the corresponding point of p1 and p2 (call them p1' and p2'). To obtain the others (that is: (x1+1,y), (x1+2,y),...,(x2-1, y)), linear interpolate p1' and p2' in the image I.
But since there is a projective transformation between Ir and I, i think that this method is quite imprecise.
Any other idea? This question is relative to the fact that i need a computational efficient way to extract a lot of (small) patches (of around 10x10 pixels) around a point p in Ir in a real-time software.
Thank you.
Ps.
Maybe the fact that i am using smal patches would make using linear interpolation a suitable approach?
You have a projective transform and, unfortunately, ratio of lengths are not invariant under this type of transformation.
My suggestion: explore the cross ratio because it is invariant under projective transformations. I think that for each 3 points you can get a "cheaper" 4th, avoiding the matrix-vector computation and using the cross ratio instead. But I have not put all the stuff in paper to verify if the cross ratio alternative is that much "computationally cheaper" than the matrix-vector multiplication.

Resources