Image Similarity - Deep Learning vs hand-crafted features - image-processing

I am doing research in the field of computer vision, and am working on a problem related to finding visually similar images to a query image. For example, finding t-shirts of similar colour with similar patterns (Striped/ Checkered), or shoes of similar colour and shape, and so on.
I have explored hand-crafted image features such as Color Histograms, Texture features, Shape features (Histogram of Oriented Gradients), SIFT and so on. I have also read up literature about Deep Neural Networks (Convolutional Neural Networks), which have been trained on massive amounts of data and are currently state of the art in Image Classification.
I was wondering if the same features (extracted from the CNN's) can also be used for my project - finding fine-grained similarities between images. From what I understand, the CNNs have learnt good representative features that can help classify images - for example, be it a red shirt or a blue shirt or an orange shirt, it is able to identify that the image is a shirt. However it doesn't understand that an orange shirt looks more similar to a red shirt than a blue shirt does, and hence it is not able to capture these similarities.
Please correct me if I am wrong. I would like to know if there are any Deep Neural Networks that capture these similarities, and have proven to be superior to the hand-crafted features. Thanks in advance.

For your task, a CNN is definitely worth a try!
Many researchers used networks which are pretrained for Image Classification and obtained state-of-the-art results on fine-grained classification. For example, trying to classify birds species or cars.
Now, your task is not classification, but it is related. You can think about similarity as some geometric distance between features, which are basically vectors. Thus, you may carry out some experiments computing the distance between the feature vectors for all your training images (the reference) and the feature vector extracted from the query image.
CNNs features extracted from the first layers of the net should be more related to color or other graphical traits, rather than more "semantical" ones.
Alternatively, there is some work on learning directly a similarity metric through CNN, see here for example.

A little bit out-dated, but it can still be useful for other people. Yes, CNNs can be used for image similarity and I used before. As Flavio pointed out, for a simple start, you can use a pre-trained CNN of your choice such as Alexnet,GoogleNet etc.. and then use it as feature extractor. You can compare the features based on the distance, similar pictures will have a smaller distance between their feature vectors.

Related

Equivalent of NMI and B3 for multilabel extrinsic clustering evaluation metrics

Normalized Mutual Information (NMI) and B3 are used for extrinsic clustering evaluation metrics when each instance (sample) has only one label.
What are equivalent metrics when each instance (sample) has only one label?
For example, in first image, we see [apple, orange, pears], in second image, we see [orange, lime, lemon] and in third image, we see [apple], and in the forth image we see [orange]. Then, if put first image and last image in the one cluster it is good, and if put third and forth image in one cluster is bad.
Application: Many popular datasets for object detection or image segmentation have multi labels for each image. If we used this data for classification (not detection and not segmentation), we have multiple labels for each image.
Note: My task is unsupervised clustering, not supervised classification. I know that for supervised classification, we can use top-5 or top-10 score. But I do not know what will be in unsupervised clustering.
If the multi-labels are still sparse, then you can use Element-centric similarity, Omega Index, or Overlapping NMI (I dont recommend the last, it has severe biases). All three are implemented in the python package, CluSim.
If the multi-labels are dense, then you are into fuzzy clustering comparison. There are several divergence measures for membership functions, including L1-norm, Euclidean distance, KL-divergence, but I'm not aware of a literature arguing for one method over another.

Why rotation-invariant neural networks are not used in winners of the popular competitions?

As known, modern most popular CNN (convolutional neural network): VGG/ResNet (FasterRCNN), SSD, Yolo, Yolo v2, DenseBox, DetectNet - are not rotate invariant: Are modern CNN (convolutional neural network) as DetectNet rotate invariant?
Also known, that there are several neural networks with rotate-invariance object detection:
Rotation-Invariant Neoperceptron 2006 (PDF): https://www.researchgate.net/publication/224649475_Rotation-Invariant_Neoperceptron
Learning rotation invariant convolutional filters for texture classification 2016 (PDF): https://arxiv.org/abs/1604.06720
RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection 2016 (PDF): http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Cheng_RIFD-CNN_Rotation-Invariant_and_CVPR_2016_paper.html
Encoded Invariance in Convolutional Neural Networks 2014 (PDF)
Rotation-invariant convolutional neural networks for galaxy morphology prediction (PDF): https://arxiv.org/abs/1503.07077
Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images 2016: http://ieeexplore.ieee.org/document/7560644/
We know, that in such image-detection competitions as: IMAGE-NET, MSCOCO, PASCAL VOC - used networks ensembles (simultaneously some neural networks). Or networks ensembles in single net such as ResNet (Residual Networks Behave Like Ensembles of Relatively Shallow Networks)
But are used rotation invariant network ensembles in winners like as MSRA, and if not, then why? Why in ensemble the additional rotation-invariant network does not add accuracy to detect certain objects such as aircraft objects - which images is done at a different angles of rotation?
It can be:
aircraft objects which are photographed from the ground
or ground objects which are photographed from the air
Why rotation-invariant neural networks are not used in winners of the popular object-detection competitions?
The recent progress in image recognition which was mainly made by changing the approach from a classic feature selection - shallow learning algorithm to no feature selection - deep learning algorithm wasn't only caused by mathematical properties of convolutional neural networks. Yes - of course their ability to capture the same information using smaller number of parameters was partially caused by their shift invariance property but the recent research has shown that this is not a key in understanding their success.
In my opinion the main reason behind this success was developing faster learning algorithms than more mathematically accurate ones and that's why less attention is put on developing another property invariant neural nets.
Of course - rotation invariance is not skipped at all. This is partially made by data augmentation where you put the slightly changed (e.g. rotated or rescaled) image to your dataset - with the same label. As we can read in this fantastic book these two approaches (more structure vs less structure + data augmentation) are more or less equivalent. (Chapter 5.5.3, titled: Invariances)
I'm also wondering why the community or scholar didn't put much attention on ration invariant CNN as #Alex.
One possible cause, in my opinion, is that many scenarios don't need this property, especially for those popular competitions. Like Rob mentioned, some natural pictures are already taken in a unified horizontal (or vertical) way. For example, in face detection, many works will align the picture to ensure the people are standing on the earth before feeding to any CNN models. To be honest, this is the most cheap and efficient way for this particular task.
However, there does exist some scenarios in real life, needing rotation invariant property. So I come to another guess: this problem is not difficult from those experts (or researchers)' view. At least we can use data augmentation to obtain some rotate invariant.
Lastly, thanks so much for your summarization about the papers. I added one more paper Group Equivariant Convolutional Networks_icml2016_GCNN and its implementation on github by other people.
Object detection is mostly driven by the successes of detection algorithms in world-famous object detection benchmarks like PASCAL-VOC and MS-COCO, which are object centric datasets where most objects are vertical (potted plants, humans, horses, etc.) and thus data augmentation with left-right flips is often sufficient (for all we know data augmentation with rotated images like upside-down flips could even hurt detection performance).
Every year the entire community adopts the base algorithmic structure of the winning solution and build on it (I am exaggerating a bit to prove a point but not so much).
Interestingly other less widely known topics like oriented text detections and oriented vehicle detections in aerial imagery both need rotation invariant features and rotation equivariant detection pipelines (like in both articles from Cheng you mentioned).
If you want to find literature and code in this area you need to dive in these two domains. I can already give you a few pointers like the DOTA challenge for aerial imagery or the ICDAR challenges for oriented text detections.
As #Marcin Mozejko said, CNN are by nature translation invariant and not rotation invariant. It is an open problem how to incorporate perfect rotation invariance the few articles that deal with it have yet to become standards even though some of them seem promising.
My personal favorite for detection is the modification of Faster R-CNN recently proposed by Ma.
I hope that this direction of research will be investigated more and more once people will get fed up of MS-COCO and VOC.
What you could try is take a state-of-the-art detector trained on MS-COCO like Faster R-CNN with NASNet from TF detection API and see how it performs wrt rotating the test image, in my opinion it would be far from rotation invariant.
Rotation invariance is mostly a good thing, but not always. Objects can have different interpretation based on their rotation, eg. if a rotated "1" might be difficult to distinguish from a "7".
First, let's acknowledge that introducing rotational invariance requires a static assumption about the distribution of angles. For example, another commenter on this page suggested rotating the kernel with 30-degree steps. That's equivalent to assuming that useful rotations in each layer are uniformly distributed over the rotation angles.
In contrast to that, when the network learns rotated kernels, the network picks a different distribution of angles for each layer. An interesting research question is to find what distribution of rotation angles is implied by learned kernels. In any case, why would such learning flexibility be useful?
I suspect that the assumption of a uniform distribution might not be equally useful across all layers of a network. In the first few convolutional layers (edges and other basic shapes), it's likely true that the rotation angles are uniformly distributed. However, in the deep layers, this assumption might be less valid. If cars are almost always rotated within a small range of angles, then why waste compute and space on unlikely rotations?
However, the network won't learn the right distribution of angles if the training dataset is not sufficiently representative. Note that simply rotating an image (called data augmentation) is not the same as rotating an object relative to other objects in the same image. I suppose it comes down to your expectation of the difference between the training dataset and the unobserved dataset to which the network has to generalize.
Interestingly, the human visual cortex is not fully rotation-invariant at the scale of major face features. See https://en.wikipedia.org/wiki/Thatcher_effect.

Keypoint recognition as classification?

At the end of the introduction to this instructive kaggle competition, they state that the methods used in "Viola and Jones' seminal paper works quite well". However, that paper describes a system for binary facial recognition, and the problem being addressed is the classification of keypoints, not entire images. I am having a hard time figuring out how, exactly, I would go about adjusting the Viola/Jones system for keypoint recognition.
I assume I should train a separate classifier for each keypoint, and some ideas I have are:
iterate over sub-images of a fixed size and classify each one, where an image with a keypoint as center pixel is a positive example. In this case I'm not sure what I would do with pixels close to the edge of the image.
instead of training binary classifiers, train classifiers with l*w possible classes (one for each pixel). The big problem with this is that I suspect it will be prohibitively slow, as every weak classifier suddenly has to do l*w*original operations
the third idea I have isn't totally hashed out in my mind, but since the keypoints are each parts of a greater part of a face (left, right center of an eye, for example), maybe I could try to classify sub-images as just an eye, and then use the left, right, and center pixels (centered in the y coordinate) of the best-fit subimage for each face-part
Is there any merit to these ideas, and are there methods I haven't thought of?
however, that paper describes a system for binary facial recognition
No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.
You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.
Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.
Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.
I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
http://deeplearning.net/tutorial/lenet.html#lenet
I made the following changes to a typical convolutional network:
I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score
I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.
Here's a github repo with some of the work I did:
https://github.com/cowpig/deep_keypoints
Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:
http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

How to create a single constant-length feature vector from a variable number of image descriptors (SURF)

My problem is as follows:
I have 6 types of images, or 6 classes. For example, cat, dog, bird, etc.
For every type of image, I have many variations of that image. For example, brown cat, black dog, etc.
I'm currently using a Support Vector Machine (SVM) to classify the images using one-versus-rest classification. I'm unfolding each image into a single pixel vector and using that as the feature vector for a given image I'm experiencing decent classification accuracy, but I want to try something different.
I want to use image descriptors, particularly SURF features, as the feature vector for each image. This issue is, I can only have a single feature vector per given image and I'm given a variable number of SURF features from the feature extraction process. For example, 1 picture of a cat may give me 40 SURF features, while 1 picture of a dog will give me 68 SURF features. I could pick the n strongest features, but I have no way of guaranteeing that the chosen SURF features are ones that describe my image (for example, it could focus on the background). There's also no guarantee that ANY SURF features are found.
So, my problem is, how can I get many observations (each being a SURF feature vector), and "fold" these observations into a single feature vector which describes the raw image and can fed to an SVM for training?
Thanks for your help!
Typically the SURF descriptors are quantized using a K-means dictionary and aggregated into one l1-normalized histogram. So your inputs to the SVM algorithm are now fixed in size.

Face Recognition Logic

I want to develop an application in which user input an image (of a person), a system should be able to identify face from an image of a person. System also works if there are more than one persons in an image.
I need a logic, I dont have any idea how can work on image pixel data in such a manner that it identifies person faces.
Eigenface might be a good algorithm to start with if you're looking to build a system for educational purposes, since it's relatively simple and serves as the starting point for a lot of other algorithms in the field. Basically what you do is take a bunch of face images (training data), switch them to grayscale if they're RGB, resize them so that every image has the same dimensions, make the images into vectors by stacking the columns of the images (which are now 2D matrices) on top of each other, compute the mean of every pixel value in all the images, and subtract that value from every entry in the matrix so that the component vectors won't be affine. Once that's done, you compute the covariance matrix of the result, solve for its eigenvalues and eigenvectors, and find the principal components. These components will serve as the basis for a vector space, and together describe the most significant ways in which face images differ from one another.
Once you've done that, you can compute a similarity score for a new face image by converting it into a face vector, projecting into the new vector space, and computing the linear distance between it and other projected face vectors.
If you decide to go this route, be careful to choose face images that were taken under an appropriate range of lighting conditions and pose angles. Those two factors play a huge role in how well your system will perform when presented with new faces. If the training gallery doesn't account for the properties of a probe image, you're going to get nonsense results. (I once trained an eigenface system on random pictures pulled down from the internet, and it gave me Bill Clinton as the strongest match for a picture of Elizabeth II, even though there was another picture of the Queen in the gallery. They both had white hair, were facing in the same direction, and were photographed under similar lighting conditions, and that was good enough for the computer.)
If you want to pull faces from multiple people in the same image, you're going to need a full system to detect faces, pull them into separate files, and preprocess them so that they're comparable with other faces drawn from other pictures. Those are all huge subjects in their own right. I've seen some good work done by people using skin color and texture-based methods to cut out image components that aren't faces, but these are also highly subject to variations in training data. Color casting is particularly hard to control, which is why grayscale conversion and/or wavelet representations of images are popular.
Machine learning is the keystone of many important processes in an FR system, so I can't stress the importance of good training data enough. There are a bunch of learning algorithms out there, but the most important one in my view is the naive Bayes classifier; the other methods converge on Bayes as the size of the training dataset increases, so you only need to get fancy if you plan to work with smaller datasets. Just remember that the quality of your training data will make or break the system as a whole, and as long as it's solid, you can pick whatever trees you like from the forest of algorithms that have been written to support the enterprise.
EDIT: A good sanity check for your training data is to compute average faces for your probe and gallery images. (This is exactly what it sounds like; after controlling for image size, take the sum of the RGB channels for every image and divide each pixel by the number of images.) The better your preprocessing, the more human the average faces will look. If the two average faces look like different people -- different gender, ethnicity, hair color, whatever -- that's a warning sign that your training data may not be appropriate for what you have in mind.
Have a look at the Face Recognition Hompage - there are algorithms, papers, and even some source code.
There are many many different alghorithms out there. Basically what you are looking for is "computer vision". We had made a project in university based around facial recognition and detection. What you need to do is google extensively and try to understand all this stuff. There is a bit of mathematics involved so be prepared. First go to wikipedia. Then you will want to search for pdf publications of specific algorithms.
You can go a hard way - write an implementaion of all alghorithms by yourself. Or easy way - use some computer vision library like OpenCV or OpenVIDIA.
And actually it is not that hard to make something that will work. So be brave. A lot harder is to make a software that will work under different and constantly varying conditions. And that is where google won't help you. But I suppose you don't want to go that deep.

Resources