Tensorflow Object Detection: Detect hands - image-processing

I am working on a Gesture Recognition project (end goal: identify the static/dynamic hand gesture under view). I intend to use Neural Networks (Python3, TensorFlow with Keras backend on Jupyter Notebook).
As the first step, to detect hands in images, I followed the Object Detection Tutorial and did whatever was mentioned. I could run the tutorial code successfully on my machine. However, it does not detect hands. I saw many posts online and I know that Hand detection is possible using the same tutorial followed. Moreover, it does mention that in the COCO dataset, one of the categories of classification is "hand" (table #2, second column, fifth row of Microsoft COCO: Common Objects in Context).
I don't know what has to be changed. Please guide me regarding this. I did view few online articles, such as this but found it difficult to follow. I want to find the location of only a hand in any image, and not bothered about other objects.
I am aware of "Tracking Custom Objects Intro - Tensorflow Object Detection API Tutorial", but is the entire process required for hand? Isn't it already trained to detect hands? (as it is mentioned in this paper)

You need to follow entire process as described in those articles you listed for any object detection. If you have images of hands, its just few days of work. Follow the video series which you listed its best of all
https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/

Related

Recognize specific images, not the objects in the images

I need to recognize specific images using the iPhone camera. My goal is to have a set of 20 images, that when a print or other display of one of them is present in front of the camera, the app recognizes that image.
I thought about using classifiers (CoreML), but I don't think it would give the intended result. For example, if I had a model that recognizes fruits, and then I showed it two different pictures of a banana, It would recognize them both as bananas, which is not what I want. I want my app to recognize specific images, regardless of its content.
The behavior I want is exactly what ARToolKit does (https://www.artoolkit.org/documentation/doku.php?id=3_Marker_Training:marker_nft_training), but I do not wish to use this library.
So my question is: Are the any other libraries, or other ways, for me to recognize specific images from the camera on iOS (preferably in Swift).
Since you are using images specific to your use case there isn't going to be an existing model that you can use. You'd have to create a model, train it, and then import it into CoreML. It's hard to provide specific advice since I know nothing about your images.
As far as libraries are concerned checkout this list and Swift-AI.
Swift-AI has a neural network that you might be able to train if you had enough images.
Most likely you will have to create the model in another language, such as Python and then import it into your Xcode project.
Take a look at this question.
This blog post goes into some detail about how to train your own model for CoreML.
Keras is probably your best bet to build your model. Take a look at this tutorial.
There are other problems too though like you only have 20 images. This is certainly not enough to train an accurate model. Also the user can present modified versions of these images. You'd have to generate realistic sample of each possible image and then use that entire set to train the model. I'd say you need a minimum of 20 images of each image (400 total).
You'll want to pre-process the image and extract features that you can compare to the known features of your images. This is how facial recognition works. Here is a guide for facial recognition that might be able to help you with feature extraction.
Simply put without a model that is based on your images you can't do much.
Answering my own question.
I ended up following this awesome tutorial that uses OpenCV to recognize specific images, and teaches how to make a wrapper so this code can be accessed by Swift.

Recognize "generic" objects

I'm working on a project for visually impaired people that converts the visual world to audio.
We prefer to create a prototype that doesn't need an internet connection. So we chose to work with OpenCV. After reading (a lot of) tutorials and documentation we were able to train OpenCV in recognizing specific objects.
For example: we trained OpenCV to recognize a certain chair and a door. That works fine.
But, we also tried to train OpenCV on a "generic" level. It should be possible to recognize (almost) all chairs. We did that by training OpenCV with a lot of positive and negative images as explained here: http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html
The actual result wasn't what we expected -he could not recognize any chair-. I know, there are a lot of different parameters to take into account (maybe we did something wrong with that) and we experimented a lot. But our time (and unfortunately our knowledge of opencv) is limited.
We are looking for some advice on how to train opencv to recognize generic objects.
Where do we start?
Is opencv even suited to do that?
Thank you for your time!
Open CV is the library to use. But object recognition is tricky. Often when people say they are doing "object recognition" they are not, they are processing one image, or at best a series of related images, to separate into object and background.
To recognise a "chair" - everything from an armchair to a dining chair to a throne - would be almost impossible. I'd want at least stereo images to give a chance to detect flat surfaces. I don't doubt that with a lot of work you can get quite a good result, maybe just recognising dining -style chairs, but it's skilled work, it's not just a case of feeding a few parameters to a hierarchical classifier.

Representing the image data for recognition

So I am working on a project for school and what we are trying to do is to teach a neural network to recognize buildings from non-buildings. The problem I am having right now is representing the data in a form, that would be "readable" by the classifier function.
The training data is a bunch of pictures + .wkt file with coordinates of buildings on a picture. So far we have been able to rescale the polygons, but kinda got stuck there.
Can you give any hints or ideas of how to bring this all to an appropriate form?
Edit: I do not need the code written for me, a link to an article on a similar subject or a book is more of stuff I am looking for.
You did not mention what framework you are using, but I will give an answer for caffe.
Your problem is very close to detecting objects within an image. You have full images with object (building in your case) bounding boxes.
The easiest way of doing this is through a python data layer which reads an image and a file with stored coordinates for that image and feeds that into your network. A tutorial on how to use it can be found here: https://github.com/NVIDIA/DIGITS/tree/master/examples/python-layer
To accelerate the process you may want to store image, coordinate pairs in your custom lmdb database.
Finally a good working example with complete caffe implementation can be found within Faster-RCNN library here: https://github.com/rbgirshick/caffe-fast-rcnn/
You should check roi_pooling_layer.cpp in their custom caffe branch and roi_data_layer on how the data is fed into the network.

3D Object tracking detection using Kinect

I am working on identifying an object by using Kinect sensor so to get x,y,z coordinates of the object.
I am trying to find the related information for this but could not able to find much. I have seen the videos as well but nobody is sharing the information or any sample code?
This is what I want to achieve https://www.youtube.com/watch?v=nw3yix3XomY
Proabably, few people may asked same question but as I am new to the Kinect and these libraries due to which I need little more guidance.
I read somewhere that object detection is not possible using Kinect v1. We need to use 3rd party libraries like open CV or point-clouds (pcl).
Can somebody help me that even by using third party libraries how exactly can I identify object via a Kinect sensor?
It will be really helpful.
Thank you.
As the author of the video you linked stated in the comment, following this PCL tutorial will help you. As you found out already, realizing this may not be possible using the standalone SDK. Relying on PCL will help you not reinvent the wheel.
The idea there is to:
Downsample the cloud to have less data to deal with in the next steps (this also reduces noise a bit).
Identify keypoints/features (i.e. points, areas, textures that remain somehow invariant to some transformations).
Compute the keypoint descriptors, mathematical representations of these features.
For each scene keypoint descriptor, find nearest neighbor into the model keypoints descriptor cloud and add it to the correspondences vector.
Perform clustering on the keypoints and detect the model in the scene.
The software in the tutorial needs the user to manually feed in the model and scene files. It doesn't do that on live feed, as the video you linked.
The process should be pretty similar though. I'm not sure how cpu-intensive the detection is, so it might require additional performance tweaking.
Once you have frame-by-frame detection in place, you could start thinking about actually tracking an object across the frames. But that's another topic.

OpenCV 2.2 image processing

I have to make an application which recognize road signs. I saw that in OpenCV folder there are some XML files for facial recognition but I do not know what that numbers in the XML represents or how they obtained those values. I need to understand this so as I can do my own XML files for road sign recognition.
I do not know much about OpenCV, anyhow I have completed my Final Year Project on Face Recognition using neural networks. Basically I used an algorithm to extract the Facial Portion from a given image. Thereafter I fed that new image (containing only the face) to a neural network that I developed using Matlab. After rigorous improvements, it was a success and by using the Simulation Feature of Matlab it was possible to precisely identify the individual.
Therefore I strongly recommend that you follow the same technique in carrying out this task.
I managed to find some interesting articles related to this topic, here, here , here and here.
What you need is two steps:
detection step
recognition step
for the detection, I suggest you to use cascade classifier that is included with opencv. It's robust and more quick than that of haar trainer. By this step you train the traffic signs to be detected. I found this tutorial that may help you how to prepare your training stuff
by this step you detect your signs . it may detect you some additional false objects in the image, for these undesired objects you can eliminate them by some processing like ratio, or color , or even by adding some negative images.
for the recognition I suggest you to use exactly the opencv's tutorial dedicated for face recognition
here you don't need a lot of modification..

Resources