I need to build an image classification model using tensor flow but in my datasets I have more than 10000 classes and only 5 images per class.
I understand that 5 is too small number of images and ideally there should be "at least" 100 images for each class, but at this point I don't understand how some "Face recognition" models can work.
For instance, all the modern smartphones provide a "face recognition" feature that can identify the phone's owner among all the faces in the world and the setup is very easy, it just needs a quick shot (3 to 5 secs) to the owner face.
So, why can this work and instead the image classification models require a high number of image to achieve an acceptable accuracy?
Are these models built using a different technology behind the scenes?
Would it be possible to build an "image classification" models using the same technology that the smartphones use for "Face recognition"?
Smartphone Face recognition: What your smartphones Face recognition system does is to identify certain key features say S, on your face. So given a new face, it will either say "Yes this face matches with the S" or "No, this face does not match with the S". So as you can see all you need is few samples of your face to identify this good set S. When it sees a new face all it has to do is to get these key features from the new face and compare it with S and finally says "Yes" or "No". It does not have to say, it is your face or your fathers face or your mothers face etc. All it has to say is "yes it matches" or "no it does not match"
Image classification: However, Image classification is a totally different task, where it has to classify each image to a class. To identify if a image is a cat it has to extract certain key features which distinguish it from other animals. So if you have have 100 such different animal you need 100 such sets of distinguishable key features. This is the reason you need large samples for each class so that the Image classification system can identify such key feature set for each class.
How you identify the key features is a totally different ball game. It can be either using the classical Image processing techniques (like SIFT, SURF etc) or by using deep learning techniques (like CNNs, Autoencoder etc)
Related
I am currently using google cloud-vision api for a project. I want to assign a unique ID to a face, so that it automatically detects which IDs any image contains. This way I can know which person is in the image.
Can cloud-vision distinguish faces and return some unique ID for a face?
NO, and as Armin has already mentioned, Google Vision API doesn't support Facial Recognition or Face verification. It only performs face detection on an image. What you can actually do is to use tensorflow to complete what you want. Let me explain for you:
A typical face recognition system (pipeline) consists of couple of phases :
Face detection: which you can do it by using Google Vision API
Facial features extraction: which you can do by using tensorflow to extract facial features and get face embeddings of each detected face from step 1. Extracting the facial features could be done by using pre-trained model which are trained on large datasets like (VGGFace2, CASIA-WebFace).
Face recognition (identification or verification): which you can achieve by using
Tensorflow to read the face embeddings (which are fetched and saved in step 2) from the desk (it could be also fetched from a database, it depends where you have saved them)
Support Vector Machines (SVM) in python to do multi-class classification.
(IMO) The most important things in face recognition systems are correctly detecting faces and correctly extracting facial features. The third step is just a classification problem and it can be done in many ways, you can also for example use the Euclidean distance between the facial embeddings to know if two faces are similar or not (identify).
For the second and the third step you can take a look at FaceNet https://github.com/davidsandberg/facenet
which is great example how you can develop your own facial recognition system based on tensorflow.
The Vision API service offers a Face Detection feature that can be used to detect multiple faces within an image along with the associated key facial attributes such as emotional state or wearing headwear. Based on this, you can get the bounding polygon around the face, the land marks, roll angle, detection confidence, among other properties; however, it is important to note that this feature doesn't support Facial Recognition, which means that it cannot be used to retrieve unique IDs for the faces detected.
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.
I'm looking for a way to detect humans in a picture. For instance, regarding the picture below, I'd like to coarsely determine how many people are in the scene. I must be able to detect both standing and sitting people. I do not mind not detecting people located behind a physical object (such as the glass in the bus picture).
AFAIK, such a problem can rather easily be solved by training deep neural networks. However, my coworkers would like me to also implement a detection technique based on general image processing techniques. I've spent several days looking for techniques designed by researchers but I couldn't find anything else than saliency-based techniques (which may be fine, but I'd like to test several techniques based on old-fashioned image processing).
I'd like to mention that I'm not new to the topic of image segmentation & I used to segment aortas in medical scans. However, this task was easier IMHO since scanners have similar features: in this use-case (human detection in a bus, for instance), the pictures will have very different characteristics (e.g. image contrast can strongly vary, whether it's been taken during the day or at night).
Long story short, I'd like to know if there's some segmentation technique for human detection for which it'd be interesting giving a shot, given the fact that the images features vary a lot?
Is deep learning the only way to detect humans in a picture?
No. Is it the best way we know? Depends on your conditions.
The simplest way of detection is to generate lots of random bounding boxes and then solving the classification problem of the crop. Here is some pythonic pseudo-code:
def detect_people(image):
"""
Find all people in image.
Parameters
----------
image : image object
Returns
-------
people : list of axis-aligned bounding boxes (aabb)
Each bounding box contains a person
"""
people = []
for aabb in generate_random_aabb(image):
crop = crop_image(image, aabb)
if is_person(crop):
people.append(crop)
return people
In this case is_person can be any classifier, e.g. boosted decision stumps as used in the Viola–Jones object detection framework. Speaking of which: That would likely be the way to go without DL, but is much more complicated to explain.
Object Detection vs Segmentation
Your question mixes both. Object detection gives you bounding boxes (coarse) for instances. Semantic segmentation labels all pixels by classes, but does not distinguish different instances of the same class (e.g. different people). Instance segmentation is like object detection, but is fine-grained and aims for pixel-exact results.
If you are interested in segemantation, I can recommend my paper: A Survey of Semantic Segmentation
In the application I am developing, I have about 5000 product label images.(One label per product).
One functionality of my application is that user can take a picture using his camera and get a possible match(es) against the product labels registered the system.
Since initially, my system only has one sample per product, I decided to go with traditional Computer Vision techniques. I managed to implement this using Feature extraction and Descriptor matching.(using OpenCV SIFT and FLANN techniques referring this: https://github.com/kipr/opencv/blob/master/samples/cpp/matching_to_many_images.cpp)
Now I am thinking how to improve the accuracy by combining with CNN or Deep Learning techniques since when users approve matches, it gradually add more label samples for a product.
Is it possible to build a hybrid image matching system combining Computer Vision techniques and CNN/Deep Learning techniques?
Are there any similar services already available as services?
You should learn more about Distance Metrics Learning (DML). There is a lot of information on the internet, but briefly:
You must get embeddings (vector representation) for each image from your base (e.g. get feature vector from last convolutional layer of one of the modern CNN's (Inception, VGG, ResNet, DenseNet))
Then, when you get new image, you should create vector representation of the current image and find the closest vector from your base (by Euclidean distance, for example)
This topic is quite complicated, so study it carefully :)
Have a luck!
I'm trying to implement a face recognition algorithm using Python. I want to be able to receive a directory of images, and compute pair-wise distances between them, when short distances should hopefully correspond to the images belonging to the same person. The ultimate goal is to cluster images and perform some basic face identification tasks (unsupervised learning).
Because of the unsupervised setting, my approach to the problem is to calculate a "face signature" (a vector in R^d for some int d) and then figure out a metric in which two faces belonging to the same person will indeed have a short distance between them.
I have a face detection algorithm which detects the face, crops the image and performs some basic pre-processing, so the images i'm feeding to the algorithm are gray and equalized (see below).
For the "face signature" part, I've tried two approaches which I read about in several publications:
Taking the histogram of the LBP (Local Binary Pattern) of the entire (processed) image
Calculating SIFT descriptors at 7 facial landmark points (right of mouth, left of mouth, etc.), which I identify per image using an external application. The signature is the concatenation of the square root of the descriptors (this results in a much higher dimension, but for now performance is not a problem).
For the comparison of two signatures, I'm using OpenCV's compareHist function (see here), trying out several different distance metrics (Chi Square, Euclidean, etc).
I know that face recognition is a hard task, let alone without any training, so I'm not expecting great results. But all I'm getting so far seems completely random. For example, when calculating distances from the image on the far right against the rest of the image, I'm getting she is most similar to 4 Bill Clintons (...!).
I have read in this great presentation that it's popular to carry out a "metric learning" procedure on a test set, which should significantly improve results. However it does say in the presentation and elsewhere that "regular" distance measures should also get OK results, so before I try this out I want to understand why what I'm doing gets me nothing.
In conclusion, my questions, which I'd love to get any sort of help on:
One improvement I though of would be to perform LBP only on the actual face, and not the corners and everything that might insert noise to the signature. How can I mask out the parts which are not the face before calculating LBP? I'm using OpenCV for this part too.
I'm fairly new to computer vision; How would I go about "debugging" my algorithm to figure out where things go wrong? Is this possible?
In the unsupervised setting, is there any other approach (which is not local descriptors + computing distances) that could work, for the task of clustering faces?
Is there anything else in the OpenCV module that maybe I haven't thought of that might be helpful? It seems like all the algorithms there require training and are not useful in my case - the algorithm needs to work on images which are completely new.
Thanks in advance.
What you are looking for is unsupervised feature extraction - take a bunch of unlabeled images and find the most important features describing these images.
The state-of-the-art methods for unsupervised feature extraction are all based on (convolutional) neural networks. Have look at autoencoders (http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity) or Restricted Bolzmann Machines (RBMs).
You could also take an existing face detector such as DeepFace (https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf), take only feature layers and use distance between these to group similar faces together.
I'm afraid that OpenCV is not well suited for this task, you might want to check Caffe, Theano, TensorFlow or Keras.
I am right now in a serious problem, I need to compare images of flowers (carnations) using a genetic algorithm, the program must determine which variety does the flower belongs to (until now I am using 15 different varieties), the thing is I am having difficulties constructing the chromosome, right know I am only analysing the HSV of each image, then a take every channel and calculate the mean for each (n=255), after that I calculate the correlation between HS, HV and SV, I expected that the mean would be enough to locate any new flower next to the clusters of flowers of the variety it belongs (by the way, I have a database of all the flowers used for training purpose) by calculating the distance between the mean of the flower and the centroids of each cluster, and probably using the correlations for adjustment, but that distance is usually way smaller to a different variety than the one it must be. Is there a way to classify this flowers using ONLY colours (I've read of applications that uses the texture, but that's way out of my league), especially using a genetic algorithm (I know Neural Networks are more appropriate to this kind of analysis but that's what the teacher asked)?. Thank you very much. By the way I am working on OpenCV, don’t know if it's relevant. PS: Excuse my English if any mistakes were done, not my native language.