Measure distance between camera and flat objects - image-processing

I have an application that is detecting objects in dashcam videos using Tensorflow object detection. I am trying to calculate the physical distance (and subsequently angle) of the detected objects from the camera.
I tried the similar triangles method described in this post:
https://medium.com/geoai/road-feature-detection-geotagging-600ea03f9a8
It is working for objects with a height such as road signs, but how do I calculate the distance of flat objects such as potholes? I tried setting height of 1mm but it is not correct.

When taking an image of a 3D scene we lose depth information in the process. In some cases, we can infer the lost information using various methods such as triangulation, or by using assumptions about the scene like the one you are making (knowing the height of the object whose distance you are trying to calculate).
When inferring the distance of an object that has no height, you will need to use some other information. For example, you can use the width/diameter of the pothole (if you know it) as replacement for the height and replacing h and H in your calculations accordingly.

Related

Is it possible to find camera position using 8-10 non-coplanar points, if their 3D coordinates are unknown?

I have a set of non-coplanar points with unknown 3D position (I am not limited with points number, let's say 8-10 of them), and at least 3 different views (number of views also not limited) of these points in 2D images. I have also estimation for rotation and scale for every point set on pictures that corresponds to real points, also an estimation of the euclidean distance between every two camera positions that images were taken at.
Is this data enough to find camera pose after taking another additional picture with these points (to find as precisely as possible)? If not, what are minimal additional data need to have to achieve this?
UPDATE: In this specific case I needed the function recoverPose() from calib3d module
Yes, this is possible. Depending on the algorithms (and the availability of some pre-calibration), you can obtain the relative positions of two cameras using a minimum of 5 to 8 points.
Beware that the point correspondences must be available, i.e. the points must be known in pairs.

Can you get your current speed using ARKit?

I have few related questions.
1) Can you get the position of the device in the global coordinates? I tried to get this value using ARFrame.camera.transform.colums.3, but it seems like the [X,Y,Z] in this column is alway [0,0,0]. I interpreted this transform to be the camera's orientation with respective to the body frame. Can someone explain what exactly you get out of the ARFrame.camera.transform?
2) If we have the position of the device (camera) in the global coordinates, I assume we can easily get the velocity of the device. Is this a valid statement?
3) Can you only get the global position when you are tracking an object? Thus, you get your position relative to the tracked object? I would like to get the speed of the device even when the camera shakes a lot, thus the tracking quality is not always good.
Yes, you can make a speedometer with ARKit. A few people have already.
Regarding your more specific questions...
ARKit doesn’t have “global coordinates”, or probably not in the sense you’re thinking. Camera and anchor transforms use a shared reference frame (“world” space in traditional 3D graphics parlance), but that reference frame is valid only within the session: 0,0,0 is where your camera/device was at the beginning of the session.
If you have two positions at two different times in any shared reference frame, the difference between those positions is the average velocity over that time.
ARKit doesn’t track objects. The camera transform is always relative to “world” space. As mentioned above, it’s 0,0,0 at the beginning of your session because the reference frame is based within the session.
If you want Global positioning — that is, relative to the Earth — you should be looking at Core Location. Note that there’s a difference of scale and precision, though: GPS is accurate to a meter or two but operates at planet scale, and ARKit is accurate to a centimeter or two but operates at room scale.

Extrinsic Camera Calibration Using OpenCV's solvePnP Function

I'm currently working on an augmented reality application using a medical imaging program called 3DSlicer. My application runs as a module within the Slicer environment and is meant to provide the tools necessary to use an external tracking system to augment a camera feed displayed within Slicer.
Currently, everything is configured properly so that all that I have left to do is automate the calculation of the camera's extrinsic matrix, which I decided to do using OpenCV's solvePnP() function. Unfortunately this has been giving me some difficulty as I am not acquiring the correct results.
My tracking system is configured as follows:
The optical tracker is mounted in such a way that the entire scene can be viewed.
Tracked markers are rigidly attached to a pointer tool, the camera, and a model that we have acquired a virtual representation for.
The pointer tool's tip was registered using a pivot calibration. This means that any values recorded using the pointer indicate the position of the pointer's tip.
Both the model and the pointer have 3D virtual representations that augment a live video feed as seen below.
The pointer and camera (Referred to as C from hereon) markers each return a homogeneous transform that describes their position relative to the marker attached to the model (Referred to as M from hereon). The model's marker, being the origin, does not return any transformation.
I obtained two sets of points, one 2D and one 3D. The 2D points are the coordinates of a chessboard's corners in pixel coordinates while the 3D points are the corresponding world coordinates of those same corners relative to M. These were recorded using openCV's detectChessboardCorners() function for the 2 dimensional points and the pointer for the 3 dimensional. I then transformed the 3D points from M space to C space by multiplying them by C inverse. This was done as the solvePnP() function requires that 3D points be described relative to the world coordinate system of the camera, which in this case is C, not M.
Once all of this was done, I passed in the point sets into solvePnp(). The transformation I got was completely incorrect, though. I am honestly at a loss for what I did wrong. Adding to my confusion is the fact that OpenCV uses a different coordinate format from OpenGL, which is what 3DSlicer is based on. If anyone can provide some assistance in this matter I would be exceptionally grateful.
Also if anything is unclear, please don't hesitate to ask. This is a pretty big project so it was hard for me to distill everything to just the issue at hand. I'm wholly expecting that things might get a little confusing for anyone reading this.
Thank you!
UPDATE #1: It turns out I'm a giant idiot. I recorded colinear points only because I was too impatient to record the entire checkerboard. Of course this meant that there were nearly infinite solutions to the least squares regression as I only locked the solution to 2 dimensions! My values are much closer to my ground truth now, and in fact the rotational columns seem correct except that they're all completely out of order. I'm not sure what could cause that, but it seems that my rotation matrix was mirrored across the center column. In addition to that, my translation components are negative when they should be positive, although their magnitudes seem to be correct. So now I've basically got all the right values in all the wrong order.
Mirror/rotational ambiguity.
You basically need to reorient your coordinate frames by imposing the constraints that (1) the scene is in front of the camera and (2) the checkerboard axes are oriented as you expect them to be. This boils down to multiplying your calibrated transform for an appropriate ("hand-built") rotation and/or mirroring.
The basic problems is that the calibration target you are using - even when all the corners are seen, has at least a 180^ deg rotational ambiguity unless color information is used. If some corners are missed things can get even weirder.
You can often use prior info about the camera orientation w.r.t. the scene to resolve this kind of ambiguities, as I was suggesting above. However, in more dynamical situation, of if a further degree of automation is needed in situations in which the target may be only partially visible, you'd be much better off using a target in which each small chunk of corners can be individually identified. My favorite is Matsunaga and Kanatani's "2D barcode" one, which uses sequences of square lengths with unique crossratios. See the paper here.

How to determine distance of objects from camera using Epipolar Plane Image?

I am working on converting 2d images to 3d environment. The images were collected from a video made in a lateral motion. Then the images were placed one behind the other, so it would be easy to find the correspondences between the two images. This is called a spatial-temporal volume.
Next I take a slice from the spatiotemporal volume. That slice is called the Epipolar Plane Image.
Using the Epipolar Plane Image, I want to calculate the depth of the objects in the scene and make a 3D enviornment. I have listed the reference but I have not been able to figure out the math described in the paper. Can someone help me figure this out? Any help is appreciated.
Reference
Epipolar-Plane Image Analysis: An Approach to Determining Structure from Motion* !
The math in this situation is easy and straight forward.
First let's define two the coordinate systems for two overlapping images taken by the same camera with the focal length with the following schema:
Let us say that first camera position is defined as follows:
While it's orientation by using three Euler angles is:
By using this definition the corresponding rotation matrix is the identity matrix
The second camera position can be defined as follows:
And since the orientation is the same as the first camera, all Euler angles remain zero:
Which also means that the corresponding rotation matrix is the identity matrix.
If the images overlap and the orientation is the same, the situation in the image space looks like this:
Here the image coordinates and their measurement accuracy are defined as follows:
This geometrical situation can be described by using the Intercept Theorem:
As you see it's not complicated. But be aware that this solution is certainly not the best, since it's base assumption that all orientation angles are the same can't be fulfilled in reality.
If you need to be accurate then you have to perform an bundle adjustment. However, this equations are often used to determine the approximated solution for this geometric situation, where the values are used to linearize the collinearity equations.

How to get the real life size of an object from an image, when not knowing the distance between object and the camera?

I have to make a mobile app that calculates the real life size of an object in an image.
I have done some research on it and found helpful [question]: How would you find the height of objects given an image?
The relation of the distance of the camera and real life size of the object isn't actually that complex, the ratio of the size of the object on the sensor and the size of the object in real life is the same as the ratio between the focal length and distance to the object.
distance to object (mm) = focal length (mm) * real height of the object (mm) * image height (pixels)
---------------------------------------------------------------------------
object height (pixels) * sensor height (mm)
But how to get the value of real height of the object if distance is not known ?
Do the tools that create 3d models from images have real life dimensions?
The simple answer is you can't.
Incidentally, this is why humans have two eyes. If you want to judge size without a known distance, you'll need at least two reference points. This allows you to triangulate the position of the object, get a distance to it, and use your known focal distance to calculate the size.
The more complex answer is there are ways around this for example:
Cheat by using a known reference:
For example, if you have an object of known size, you can infer the distance. This is similar to what NASA does to calibrate its cameras, for example.
You can make safe assumptions if you're dealing with common objects, such as the height of one storey when analysing the image of a building.
Move your camera around:
This allows you to get more than one reference point with the same camera.
I suppose you could use the accelerometer to accurately measure the positional relation between the image captured at point T1 in time and point T2. This would give you two images of the same subject with a known distance between them. This then allows you to triangulate as if you had two eyes.
Whether normal hand-held camera jitters will be sufficient for triangulation, or whether the accelerometer will be accurate enough to inertially position the phone, I don't know.
Assume a distance:
If your app is designed to compare something on the scale of a human hand (or other bit of human anatomy), you can probably safely assume a distance based on what people will naturally do. The focus limits of the camera itself will also give an upper and lower range on how far an object can be and still be in focus. This will probably be within a tolerable margin of error.
As you mention in your question, there is an entire subfield dedicated to this question, and it is an active research area.

Resources