Kinect V2 depth in metres - Processing 3.x

Kinect V2 depth in metres - Processing 3.x - image-processing

I am using Processing 3.3.6 with the openkinect library (link below). I have a Kinect V2 sensor, and as given in the examples in the below link, I am getting the depth values from a depth[] array.
Openkinect Library for Processing
The link above gives the formula given for converting the raw depth value to depth value in meters in real world.
depthInMeters = 1.0 / (rawDepth * -0.0030711016 + 3.3309495161);
This is adapted from here :Depth in meters calculation
I am getting values ranging from 0 - 4500, and after applying the formulae from the above references, the values after conversion to meters are not accurate, they are off by about 70m. So, is there any other way or method to convert the depth to meters. Should I only use the official development environment like Visual Studio (C++ , C#) with the SDK for calculating depth?
Or can open source tools like Processing be used to capture the values albeit with a different approach? Help or guidance would be appreciated as it is a completely new area for me so far.

Related

FMCW radar: understanding of doppler fft

I am using fmcw radar to find out distance and speed of moving objects using stm32l476 microcontroller. I transmit the modulation signal as sawtooth waveform and I read the recieved signal in the digital form using ADC function available. Then, I copy this recieved ADC data into fft_in array(converting it into float32_t)(fft_in array size = 512). After copying this fft_in array, I apply fft on this array and process it for finding out range of the object. Until here everything works fine.
Now, in order to find velocity of the object, first, I copy this arrays(fft_in) as rows of the matrix for 64 chirps(Matrix size[64][512]). Then, I take Peak range bin column and apply fft for this column array. So while processing this column array by applying fft, its length reduce to half[32 elements]. Then finding out peak value bin multiplied by frequnecy resolution gives the phase differnce 'w' from which velocity can be calculated as "𝐯=𝛌𝛚/𝟒𝛑𝐓 𝐜".
while running this algorithm, I find that when object is stationery, I get peak value at 22th element(out of 32 elements). what does this imply?
I have sampling frequency for ADC as 24502hz. So per bin value for range estimation is 47.8566hz (24502/512).
I have 64 chirps and Tc is 0.006325s. So 1/0.006325 gives 158.10Hz.What would be per velocity bin resolution, Is it 2.47Hz(158.10/64)? I have bit confusion in this concept.How does 2nd fft works for finding out velocity in fmcw radar?

Infineon has excellent resources on this topic, see this FAQ for the basics: https://www.infineon.com/dgdl/Infineon-Radar%20FAQ-PI-v02_00-EN.pdf?fileId=5546d46266f85d6301671c76d2a00614
If you want to know more details, check out the P2G Software User Manual:
https://www.infineon.com/dgdl/Infineon-P2G_Software_User_Manual-UserManual-v01_01-EN.pdf?fileId=5546d4627762291e017769040a233324 (Chapter 4)
There is even the software available with all the algorithms (including FMCW). How to get the software with the "Infineon Toolbox" is described here: https://www.mouser.com/pdfdocs/Infineon_Position2Go_QS.pdf
Some hints from me:
I suggest applying a window function before the fft https://en.wikipedia.org/wiki/Window_function and remove the mean.
Read about frequency mixers https://en.wikipedia.org/wiki/Frequency_mixer

How to calculate radius in orb?

Studying ORB feature descriptors from it is the official paper I found it stating:
We empirically
choose r to be the patch size, so that that x and y run from
[−r, r]. As |C| approaches 0,
I did not understand how r is calculated, please tell me how to calculate r.
I tried a lot to dig deeper using the internet but I couldn't find formula or explaining and I did not understand what it stated means.
Would you please explain it for me? And give me the formula if you may.

The paper says:
"We empirically choose r to be the patch size,..."
In OpenCV a patch size of 31 (seems to be the standard value) is used.
The intensity patches with the specified size are used for the description of a FAST keypoint. Since ORB uses the BRIEF descriptor an image patch is transferred into a binary string which is later compared to match the keypoints. More details are found in the BRIEF paper.
So if you increase r you will increase the size of the binary string.
So the radius is not calculated by some formula but instead chosen by the developers/user.

OpenCV: PNP pose estimation fails in a specific case

I am using OpenCV's solvePnPRansac function to estimate the pose of my camera given a pointcloud made from tracked features. My pipeline consists of multiple cameras where I form the point cloud from matched features between two cameras, and use that as a reference to estimate the pose of one of the cameras as it starts moving. I have tested this in multiple settings and it works as long as there are enough features to track while the camera is in motion.
Strangely, during a test I did today, I encountered a failure case where solvePnP would just return junk values all the time. What's confusing here is that in this data set, my point cloud is much denser, it's reconstructed pretty accurately from the two views, the tracked number of points (currently visible features vs. features in the point cloud) at any given time was much higher than what I usually have, so theoretically it should have been a breeze for solvePnP, yet it fails terribly.
I tried with CV_ITERATIVE, CV_EPNP and even the non RANSAC version of solvePnP. I was just wondering if I am missing something basic here? The scene I am looking at can be seen in these images (image 1 is the scene and feature matches between two perspectives, image 2 is the point cloud for reference)
The part of the code doing PNP is pretty simple. If P3D is the array of tracked 3Dpoints, P2D is the corresponding set of image points,
solvePnpRansac(P3D, P2D, K, d, R, T, false, 500, 2.0, 100, noArray(), CV_ITERATIVE);
EDIT: I should also mention that my reference poincloud was obtained with a baseline of 8 feet between the cameras, whereas the building I am looking at was probably like a 100 feet away. Could the possible lack of disparity cause issues as well?

Determine skeleton joints with a webcam (not Kinect)

I'm trying to determine skeleton joints (or at the very least to be able to track a single palm) using a regular webcam. I've looked all over the web and can't seem to find a way to do so.
Every example I've found is using Kinect. I want to use a single webcam.
There's no need for me to calculate the depth of the joints - I just need to be able to recognize their X, Y position in the frame. Which is why I'm using a webcam, not a Kinect.
So far I've looked at:
OpenCV (the "skeleton" functionality in it is a process of simplifying graphical models, but it's not a detection and/or skeletonization of a human body).
OpenNI (with NiTE) - the only way to get the joints is to use the Kinect device, so this doesn't work with a webcam.
I'm looking for a C/C++ library (but at this point would look at any other language), preferably open source (but, again, will consider any license) that can do the following:
Given an image (a frame from a webcam) calculate the X, Y positions of the visible joints
[Optional] Given a video capture stream call back into my code with events for joints' positions
Doesn't have to be super accurate, but would prefer it to be very fast (sub-0.1 sec processing time per frame)
Would really appreciate it if someone can help me out with this. I've been stuck on this for a few days now with no clear path to proceed.
UPDATE
2 years later a solution was found: http://dlib.net/imaging.html#shape_predictor

To track a hand using a single camera without depth information is a serious task and topic of ongoing scientific work. I can supply you a bunch of interesting and/or highly cited scientific papers on the topic:
M. de La Gorce, D. J. Fleet, and N. Paragios, “Model-Based 3D Hand Pose Estimation from Monocular Video.,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, Feb. 2011.
R. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Transactions on Graphics (TOG), 2009.
B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Model-based hand tracking using a hierarchical Bayesian filter.,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 9, pp. 1372–84, Sep. 2006.
J. M. Rehg and T. Kanade, “Model-based tracking of self-occluding articulated objects,” in Proceedings of IEEE International Conference on Computer Vision, 1995, pp. 612–617.
Hand tracking literature survey in the 2nd chapter:
T. de Campos, “3D Visual Tracking of Articulated Objects and Hands,” 2006.
Unfortunately I don't know about some freely available hand tracking library.

there is a simple way for detecting hand using skin tone. perhaps this could help... you can see the results on this youtube video. caveat: the background shouldn't contain skin colored things like wood.
here is the code:
''' Detect human skin tone and draw a boundary around it.
Useful for gesture recognition and motion tracking.
Inspired by: http://stackoverflow.com/a/14756351/1463143
Date: 08 June 2013
'''
# Required moduls
import cv2
import numpy
# Constants for finding range of skin color in YCrCb
min_YCrCb = numpy.array([0,133,77],numpy.uint8)
max_YCrCb = numpy.array([255,173,127],numpy.uint8)
# Create a window to display the camera feed
cv2.namedWindow('Camera Output')
# Get pointer to video frames from primary device
videoFrame = cv2.VideoCapture(0)
# Process the video frames
keyPressed = -1 # -1 indicates no key pressed
while(keyPressed < 0): # any key pressed has a value >= 0
# Grab video frame, decode it and return next video frame
readSucsess, sourceImage = videoFrame.read()
# Convert image to YCrCb
imageYCrCb = cv2.cvtColor(sourceImage,cv2.COLOR_BGR2YCR_CB)
# Find region with skin tone in YCrCb image
skinRegion = cv2.inRange(imageYCrCb,min_YCrCb,max_YCrCb)
# Do contour detection on skin region
contours, hierarchy = cv2.findContours(skinRegion, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Draw the contour on the source image
for i, c in enumerate(contours):
area = cv2.contourArea(c)
if area > 1000:
cv2.drawContours(sourceImage, contours, i, (0, 255, 0), 3)
# Display the source image
cv2.imshow('Camera Output',sourceImage)
# Check for user input to close program
keyPressed = cv2.waitKey(1) # wait 1 milisecond in each iteration of while loop
# Close window and camera after exiting the while loop
cv2.destroyWindow('Camera Output')
videoFrame.release()
the cv2.findContour is quite useful, you can find the centroid of a "blob" by using cv2.moments after u find the contours. have a look at the opencv documentation on shape descriptors.
i havent yet figured out how to make the skeletons that lie in the middle of the contour but i was thinking of "eroding" the contours till it is a single line. in image processing the process is called "skeletonization" or "morphological skeleton". here is some basic info on skeletonization.
here is a link that implements skeletonization in opencv and c++
here is a link for skeletonization in opencv and python
hope that helps :)
--- EDIT ----
i would highly recommend that you go through these papers by Deva Ramanan (scroll down after visiting the linked page): http://www.ics.uci.edu/~dramanan/
C. Desai, D. Ramanan. "Detecting Actions, Poses, and Objects with
Relational Phraselets" European Conference on Computer Vision
(ECCV), Florence, Italy, Oct. 2012.
D. Park, D. Ramanan. "N-Best Maximal Decoders for Part Models" International Conference
on Computer Vision (ICCV) Barcelona, Spain, November 2011.
D. Ramanan. "Learning to Parse Images of Articulated Objects" Neural Info. Proc.
Systems (NIPS), Vancouver, Canada, Dec 2006.

The most common approach can be seen in the following youtube video. http://www.youtube.com/watch?v=xML2S6bvMwI
This method is not quite robust, as it tends to fail if the hand is rotated to much (eg; if the camera is looking at the side of the hand or at a partially bent hand).
If you do not mind using two camera's you can look into the work Robert Wang. His current company (3GearSystems) uses this technology, augmented with a kinect, to provide tracking. His original paper uses two webcams but has much worse tracking.
Wang, Robert, Sylvain Paris, and Jovan Popović. "6d hands: markerless hand-tracking for computer aided design." Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.
Another option (again if using "more" than a single webcam is possible), is to use a IR emitter. Your hand reflects IR light quite well whereas the background does not. By adding a filter to the webcam that filters normal light (and removing the standard filter that does the opposite) you can create a quite effective hand tracking. The advantage of this method is that the segmentation of the hand from the background is much simpler. Depending on the distance and the quality of the camera, you would need more IR leds, in order to reflect sufficient light back into the webcam. The leap motion uses this technology to track the fingers & palms (it uses 2 IR cameras and 3 IR leds to also get depth information).
All that being said; I think the Kinect is your best option in this. Yes, you don't need the depth, but the depth information does make it a lot easier to detect the hand (using the depth information for the segmentation).

My suggestion, given your constraints, would be to use something like this:
http://docs.opencv.org/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html
Here is a tutorial for using it for face detection:
http://opencv.willowgarage.com/wiki/FaceDetection?highlight=%28facial%29|%28recognition%29
The problem you have described is quite difficult, and I'm not sure that trying to do it using only a webcam is a reasonable plan, but this is probably your best bet. As explained here (http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html?highlight=load#cascadeclassifier-load), you will need to train the classifier with something like this:
http://docs.opencv.org/doc/user_guide/ug_traincascade.html
Remember: Even though you don't require the depth information for your use, having this information makes it easier for the library to identify a hand.

At last I've found a solution. Turns out a dlib open-source project has a "shape predictor" that, once properly trained, does exactly what I need: it guesstimates (with a pretty satisfactory accuracy) the "pose". A "pose" is loosely defined as "whatever you train it to recognize as a pose" by training it with a set of images, annotated with the shapes to extract from them.
The shape predictor is described in here on dlib's website

I don't know about possible existing solutions. If supervised (or semi-supervised) learning is an option, training decision trees or neural networks might already be enough (kinect uses random forests from what i have heard). Before you go such a path, do everything you can to find an existing solution. Getting Machine Learning stuff right takes a lot of time and experimentation.
OpenCV has machine learning components, what you would need is training data.

With the motion tracking features of the open source Blender project it is possible to create a 3D model based on 2D footage. No kinect needed. Since blender is open source you might be able to use their pyton scripts outside the blender framework for your own purposes.

Have you ever heard about Eyesweb
I have been using it for one of my project and I though it might be usefull for what you want to achieve.
Here are some interesting publication LNAI 3881 - Finger Tracking Methods Using EyesWeb and Powerpointing-HCI using gestures
Basically the workflow is:
You create your patch in EyesWeb
Prepare the datas you want to send with a network client
Use theses processed datas on your own server (your app)
However, I don't know if there is a way to embed the real time image processing part of Eyes Web into a soft as a library.

OpenCV Multilevel B-Spline Approximation

Hi (sorry for my english) .. i'm working in a project for University in this project i need to use the MBA (Multilevel B-Spline Approximation) algorithm to get some points (control points) of a image to use in other operations.
I'm reading a lot of papers about this algorithm, and i think i understand, but i can't writing.
The idea is: Read a image, process a image (OpenCV), then get control points of the image, use this points.
So the problem here is:
The algorithm use a set of points {(x,y,z)} , this set of points are approximated with a surface generated with the control points obtained from MBA. the set of points {(x,y,z)} represents de data we need to approximate (the image)..
So, the image is in a cv::Mat format , how can transform this format to an ordinary array to simply access to the data an manipulate...
Here are one paper with an explanation of the method:
(Paper) REGULARIZED MULTILEVEL B-SPLINE REGISTRATION
(Paper)Scattered Data Interpolation with Multilevel B-splines
(Matlab)MBA
If someone can help, maybe a guideline, idea or anything will be appreciate ..
Thanks in advance.
EDIT: Finally i wrote the algorithm in C++ using armadillo and OpenCV ...

Well i'm using armadillo a C++ linear algebra library to works with matrix for the algorithm

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart