How to use Normalized MPJPE evaluation metric in 3D pose estimation - machine-learning

In 3D pose estimation MPJPE and P-MPJPE are the popular measures used to evaluate the correctness of pose. Obviously MPJPE has some issues like the keypoints away from Pelvis will have high MPJPE like Ankle keypoint. So I am trying to get the Normalized MPJPE scores for the same pose. But I am getting high values. Here is the code that I have used and the output I got. Please let me know where the issue is.
Ground Truth and Predicted data shape : [num_poses, num_keypoints, 3]
Output

Related

How to calculate the confidence score of a keypoint estimation from a heatmap

I have tried to build the Convolutional Pose Machines model from this paper here (https://arxiv.org/pdf/1602.00134.pdf).
The model works fine and outputs 15 heatmaps (one per keypoint + 1 for background). From these heatmaps I can calculate the keypoint positions (simply the max value in the heatmap).
My question is: Is this maximum value in the heatmap also equal to the confidence score of the model that the keypoint is in the image?
Maybe this is a dumb question but in the paper the authors don't mention how they calculate the confidence score or how they handle non-visible keypoints.
Best way to answer, I believe, is to dig into the actual code of popular pose estimation models using convolutional approach, to see how this is done in practice.
The Google TensorFlow PoseNet model should be a good example.
What they do in their (open source) code, here (check out the predict method), is to apply a 2D sigmoid activation function to the heatmaps, for each keypoint of the pose.
So, to answer your question, I would say that the maximum value in the heatmap is not directly equal to the confidence score - the output of the sigmoid function is (proper score from 0 to 1)

Measure to separate positive & negative examples using SIFT keypoint object detection?

I used SIFT keypoint descriptors for detecting objects in an image. For that, I used best matches and calculated homography matrix.
Using this homography matrix, I found where the object lies in test image.
Now, for samples where object could not be found which has to be checked manually, what could be the measure which can help to distinguish between negative and positive samples.
Presently, using determinant of homography matrix we are separating the samples. Is there a better measure ?
You may use the number of point correspondences(filtered) as a measure which can help to distinguish between negative and positive samples.
Because positive samples always have much more point correspondences than negative samples.

estimateRigidTransform on dense optical flow in OpenCV

I am trying to implement the algorithm described by Wand and Adelson (1993) in the paper Layered Representation for Motion Analysis, a summary of the paper can be found in any lecture on Computer Vision, this one is from the CS department at the University of North Carolina at Chapel Hill, starting at slide 53:
Compute local flow in a coarse-to-fine fashion
Obtain a set of initial affine motion hypotheses
Divide the image into blocks and estimate affine motion parameters in each block by least squares
Eliminate hypotheses with high residual error
Perform k-means clustering on affine motion parameters
Merge clusters that are close and retain the largest clusters to obtain a smaller set of hypotheses to describe all the motions in the scene
Iterate until convergence:
Assign each pixel to best hypothesis
Pixels with high residual error remain unassigned
Perform region filtering to enforce spatial constraints
Re-estimate affine motions in each region
Since I am using OpenCV to implement the algorithm, it only makes sense to use the built in functions to do so. And the most meaningful function is estimateRigidTransform, from the documentation:
Computes an optimal affine transformation between two 2D point sets.
and the output looks something like this:
The description of the affine model in the slides looks like:
This model is of course consistent with the description in the paper (and everywhere else).
If I want to try and map the output of the function to the given model, the only explanation would be that a_1 and a_4 map to b_1 and b_2. Is this intuition correct? and following from this does it make sense that a_3 = -a_5?
Knowing that the output of calcOpticalFlowFarneback is a flow matrix following the relationship:
What is the form of the input that should be used to get the correct the results from the estimateRigidTransform function? and how is it possible to calculate the residuals after the estimation?
Finally, am I considering the wrong function to calculate the affine transforms?

3D stereo, bad 3D coordinates

I'm using stereo vision to obtain 3D reconstruction. I'm using opencv library.
I've implemented my code this way:
1) Stereo Calibration
2) undistort and Rectification of image pair
3) disparity map - using SGBM
4) 3D coordinates calculating depht map - unsing reprojectImageTo3D();
Results:
-Good disparity map, and good 3D reconstruction
-Bad 3D coordinates values, the distances don't corresponde to the reality.
The 3D distances, the distante between camera and object, have 10 mm error and increse with distance. I,ve used various baselines and i get always error.
When i compare the extrinsic parameter, vector T, output of "stereoRectify" the baseline match.
So i dont know where the problem is.
Can someone help me please, thanks in advance
CAlibration:
http://textuploader.com/ocxl
http://textuploader.com/ocxm
Ten mm error can be reasonable for stereo vision solutions, all depending of course on the sensor sensitivity, resolution, baseline and the distance to the object.
The increasing error with respect to the object's distance is also typical to the problem - the stereo correspondence essentially performs triangulation between the two video sensors to the object, and the larger the distance is the derivative of the angle between the video sensors to the object translates to larger distance on the depth axis, which means larger error. Good example is when the angle between the video sensors to the object is almost right, which means that any small positive error in estimating it will throw the estimated depth to infinity.
The architecture you selected looks good. You can try increasing the sensors resolution, or maybe dig in to the calibration process which has a lot of room for tuning in the openCV library - making sure only images taken with the chessboard being static are selected, choose higher variety of the different poses of the chessboard, adding images until the registration between the two images drops below the maximal error you can allow, etc.

Accuracy in depth estimation - Stereo Vision

I am doing a research in stereo vision and I am interested in accuracy of depth estimation in this question. It depends of several factors like:
Proper stereo calibration (rotation, translation and distortion extraction),
image resolution,
camera and lens quality (the less distortion, proper color capturing),
matching features between two images.
Let's say we have a no low-cost cameras and lenses (no cheap webcams etc).
My question is, what is the accuracy of depth estimation we can achieve in this field?
Anyone knows a real stereo vision system that works with some accuracy?
Can we achieve 1 mm depth estimation accuracy?
My question also aims in systems implemented in opencv. What accuracy did you manage to achieve?
Q. Anyone knows a real stereo vision system that works with some accuracy? Can we achieve 1 mm depth estimation accuracy?
Yes, you definitely can achieve 1mm (and much better) depth estimation accuracy with a stereo rig (heck, you can do stereo recon with a pair of microscopes). Stereo-based industrial inspection systems with accuracies in the 0.1 mm range are in routine use, and have been since the early 1990's at least. To be clear, by "stereo-based" I mean a 3D reconstruction system using 2 or more geometrically separated sensors, where the 3D location of a point is inferred by triangulating matched images of the 3D point in the sensors. Such a system may use structured light projectors to help with the image matching, however, unlike a proper "structured light-based 3D reconstruction system", it does not rely on a calibrated geometry for the light projector itself.
However, most (likely, all) such stereo systems designed for high accuracy use either some form of structured lighting, or some prior information about the geometry of the reconstructed shapes (or a combination of both), in order to tightly constrain the matching of points to be triangulated. The reason is that, generally speaking, one can triangulate more accurately than they can match, so matching accuracy is the limiting factor for reconstruction accuracy.
One intuitive way to see why this is the case is to look at the simple form of the stereo reconstruction equation: z = f b / d. Here "f" (focal length) and "b" (baseline) summarize the properties of the rig, and they are estimated by calibration, whereas "d" (disparity) expresses the match of the two images of the same 3D point.
Now, crucially, the calibration parameters are "global" ones, and they are estimated based on many measurements taken over the field of view and depth range of interest. Therefore, assuming the calibration procedure is unbiased and that the system is approximately time-invariant, the errors in each of the measurements are averaged out in the parameter estimates. So it is possible, by taking lots of measurements, and by tightly controlling the rig optics, geometry and environment (including vibrations, temperature and humidity changes, etc), to estimate the calibration parameters very accurately, that is, with unbiased estimated values affected by uncertainty of the order of the sensor's resolution, or better, so that the effect of their residual inaccuracies can be neglected within a known volume of space where the rig operates.
However, disparities are point-wise estimates: one states that point p in left image matches (maybe) point q in right image, and any error in the disparity d = (q - p) appears in z scaled by f b. It's a one-shot thing. Worse, the estimation of disparity is, in all nontrivial cases, affected by the (a-priori unknown) geometry and surface properties of the object being analyzed, and by their interaction with the lighting. These conspire - through whatever matching algorithm one uses - to reduce the practical accuracy of reconstruction one can achieve. Structured lighting helps here because it reduces such matching uncertainty: the basic idea is to project sharp, well-focused edges on the object that can be found and matched (often, with subpixel accuracy) in the images. There is a plethora of structured light methods, so I won't go into any details here. But I note that this is an area where using color and carefully choosing the optics of the projector can help a lot.
So, what you can achieve in practice depends, as usual, on how much money you are willing to spend (better optics, lower-noise sensor, rigid materials and design for the rig's mechanics, controlled lighting), and on how well you understand and can constrain your particular reconstruction problem.
I would add that using color is a bad idea even with expensive cameras - just use the gradient of gray intensity. Some producers of high-end stereo cameras (for example Point Grey) used to rely on color and then switched to grey. Also consider a bias and a variance as two components of a stereo matching error. This is important since using a correlation stereo, for example, with a large correlation window would average depth (i.e. model the world as a bunch of fronto-parallel patches) and reduce the bias while increasing the variance and vice versa. So there is always a trade-off.
More than the factors you mentioned above, the accuracy of your stereo will depend on the specifics of the algorithm. It is up to an algorithm to validate depth (important step after stereo estimation) and gracefully patch the holes in textureless areas. For example, consider back-and-forth validation (matching R to L should produce the same candidates as matching L to R), blob noise removal (non Gaussian noise typical for stereo matching removed with connected component algorithm), texture validation (invalidate depth in areas with weak texture), uniqueness validation (having a uni-modal matching score without second and third strong candidates. This is typically a short cut to back-and-forth validation), etc. The accuracy will also depend on sensor noise and sensor's dynamic range.
Finally you have to ask your question about accuracy as a function of depth since d=f*B/z, where B is a baseline between cameras, f is focal length in pixels and z is the distance along optical axis. Thus there is a strong dependence of accuracy on the baseline and distance.
Kinect will provide 1mm accuracy (bias) with quite large variance up to 1m or so. Then it sharply goes down. Kinect would have a dead zone up to 50cm since there is no sufficient overlap of two cameras at a close distance. And yes - Kinect is a stereo camera where one of the cameras is simulated by an IR projector.
I am sure with probabilistic stereo such as Belief Propagation on Markov Random Fields one can achieve a higher accuracy. But those methods assume some strong priors about smoothness of object surfaces or particular surface orientation. See this for example, page 14.
If you wan't to know a bit more about accuracy of the approaches take a look at this site, although is no longer very active the results are pretty much state of the art. Take into account that a couple of the papers presented there went to create companies. What do you mean with real stereo vision system? If you mean commercial there aren't many, most of the commercial reconstruction systems work with structured light or directly scanners. This is because (you missed one important factor in your list), the texture is a key factor for accuracy (or even before that correctness); a white wall cannot be reconstructed by a stereo system unless texture or structured light is added. Nevertheless, in my own experience, systems that involve variational matching can be very accurate (subpixel accuracy in image space) which is generally not achieved by probabilistic approaches. One last remark, the distance between cameras is also important for accuracy: very close cameras will find a lot of correct matches and quickly but the accuracy will be low, more distant cameras will find less matches, will probably take longer but the results could be more accurate; there is an optimal conic region defined in many books.
After all this blabla, I can tell you that using opencv one of the best things you can do is do an initial cameras calibration, use Brox's optical flow to find find matches and reconstruct.

Resources