I want to find images similar to another image. So after researching i found two methods first was two represent the image by its attributes like
length = full
pattern = check
color = blue
but the limitation of this method is that I will not be able to get an exhaustive dataset with all the features marked.
The second approach I found was to extract features and do feature mapping.
So I decided to use deep convolution neural networks with caffe so that by using any of the exsisting models I could learn the features and then perform feature matching or some other operation. I just wanted to take a general advice what can be the other methods which are good and worth a try. And since I am just starting out with caffe so can anyone give a general guideline how to approach the problem with caffe?
Thanks in advance
I looked at phash just was curious that it will find the images which are same like there are minor intensity variations and some other variation wiill it also work to give the same type(semantically) like for a tshirt with blue and red stripes will it give black and white stripe as similar and would it consider things like the length of shirt, collar style etc
It's true, that it's been empirically shown, that the euclidean distance between the features extracted using ConvNets is closer for images of the same class, while farther for images of different classes - but it's important to understand what kind of similarity you're looking for.
One can define many types of similarity measures, and the type of features you use (in the case of ConvNets, the type of data it was trained on) affects the kind of similar images you'll get. For instance, maybe given an image of a dog, you want to find other pictures of dogs but not specifically that exact dog, alternatively, maybe you have a picture of a church and you want to find another image of the exact same church but from a different angle - these are two very different problems, with different methods you can use to solve them.
One particular kind of convolutional neural networks you can look at, are Siamese Network, which are built to learn similarities between two images, given a dataset of pairs of images with the labels same/not_same. You can look for implementation in Caffe for this method here.
A different method, is to take a ConvNet trained on ImageNet data (see here for options), and use the python/matlab interface to classify images, and then extract the second to last layer, and use that as the representation for that image. Now you can just take the euclidean distance of those representations and this would be your similarity measure.
Unrelated to Caffe, you can also use "old school" methods of feature matching, included in open source libraries like OpenCV (an example tutorial of such method).
Related
I dont mean that a neural network can complete the work of traditional image processing algorithm.What i want to say is if it exists a kind of neural network can use the parameters of the traditional method as input and outputs more universal parameters that dont require manual adjustment.Intuitively, my ideas are less efficient than using neural networks directly,but I don't know much about the mathematics of neural networks.
If I understood correctly, what you mean is for a traditional method (let's say thresholding), you want to find the best parameters using ann. It is possible but you have to supply so many training data which needs to be created, processed and evaluated that it will take a lot of time. AFAIK many mobile phones that have AI assisted camera use this method to find the best aperture, exposure..etc.
First of all, thank you very much. I still have two things to figure out. If I wanted to get a (or a set of) relatively optimal parameters, what data set would I need to build (such as some kind of error between input and output and threshold) ? Second, as you give an example, is it more efficient or better than traversal or Otsu to select the optimal threshold through neural networks in practice?To be honest, I wonder if this is really more efficient than training input and output directly using neural networks
For your second question, Otsu only works on cases where the histogram has two distinct peaks. Thresholding is a simple function but the cut-off value is based on your objective; there is no single "best" value valid for every case. So if you want to train a model for thresholding, I think you have to come up with separate models for each case (like a model for thresholding bright objects, another for darker ones...etc.) Maybe an additional output parameter for determining the aim works but I am not sure. Will it be more efficient and better? Depends on the case (and your definition of better). Otsu, traversal or adaptive thresholding does not work all the time (actually Otsu has very specific use cases). If they work for your case, excellent. If not, then things get messy. So to answer your question, it depends on your problem at hand.
For the first question, TBF, it is quite difficult to work with images in traditional ANNs. Images have a lot of pixels, so standard ANNs struggle with inputs. Moreover, when the location/scale of an object in the image changes, the whole pixel data changes even though the content is the same (These are the reasons why CNN's are superior to ANN's for images). For these reasons it is better to use processed metrics which contain condensed and location-invariant information. E.g. for thresholding, you can give the histogram and it returns a thresholding value. Therefore you need an ann with 256 input neurons (for an intensity histogram of 8bit grayscale image), 1 output neuron, and 1-2 middle layers with some deeply connected neurons (128 maybe?). Your training data will be a bunch of histograms as input and corresponding best threshold value for each histogram. Then once training is finished, you can give the ANN a histogram it has never seen before and it will tell you the optimal thresholding value based on its training.
what I want to do is a model that can output different parameters (parameter sets) based on different input images, so I think if you choose a good enough data set it should be somewhat universal.
Most likely, but your data set should be quite inclusive of expected images (in terms of metrics and features), which means it has to be large.
Also, I don't know much about modeling -- can I use a function about the output/parameters (which might be a function about the result of the traditional method) as an error in the back-propagation by create a custom loss function?
I think so, but training the model will be more involved than using predefined loss functions because, well, you have to write them. Also you have to test they work as expected.
I want to build a face detector/classifier to generate a network that detects whether a face is present in an image/video.
I understand the basic concept, but what I have problems with is the choice of the number of classes.
Initially, I thought that two classes (with face / without face) would be sufficient. However, I was unsure which data I should use for the class 'without face'. So I threw together datasets of equipment and plants and animals, whereupon the classes were very unbalanced, which is apparently not good.
Then I thought it would be better to use as many classes as possible.
But again, I am unsure what would be the best/common approach to the problem?
You can experiment with any number of samples and different images for the negative class. If the datasets with equipment/plant/places you have are imbalanced, you can try to subsample, e.g. pick 100 images from each.
Just don't make the negative class too huge, w.r.t the number of images with human samples you have. The rest is up to experimentation.
I want to train a neural network to extract a number of (128) face features from an image.
The features are numbers that measure things like the distance between middles of the eyes, or the distance between middle of the left eyes and middle point of mouth.
I need this to find the dissimilarity between two faces: given a database with users, by analyzing a photo I'll be able to tell if it's a photo of Jhon.
I began my study using this link, which states: Researchers have discovered that the most accurate approach is to let the computer figure out the measurements to collect itself.
Ok, so the output of the network is an array of 128 numbers, I'll use some formula to adjust the weights so the output numbers are as accurate as possible.
What should I use as input? Will my input nodes be three photos, like in this article, and I'll extract the features based on the comparisons between the photos?
My first thought would be for you to use a library as Openface, which is already trained with lots of faces and has a great face representation (with the same 128 dimensions you need).
However, you mentioned that you want to train it yourself. I'd recommend you to start taking a look at Siamese Neural Networks. Siamese Neural Networks receive a pair of images (genuine pair - e.g. images from the same person; impostor pair - e.g. images from different persons) and try to learn a similarity/dissimilarity metric (also called Metric Learning). It is very useful for learning face embeddings since your goal seems to be related to that. They basically learn a way to map the input images to a representation that "benefits comparison". Other implementations (as OpenFace) are trained with Triplet Embeddings, where instead of a pair of images you receive a triple (two similar and one dissimilar).
Here are some references to start with Siamese Networks:
Signature recognition (a little old but good to understand about them): https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf
Siamese networks for face embeddings: http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf
Triplet Embeddings paper: https://arxiv.org/pdf/1503.03832.pdf
Just keep in mind that training these architectures is quite difficult, since selecting the best pairs is a very important and challenging part of the problem. One paper that mentions some of the challenges for creating image pairs but is not related to faces is this one.
Hope that helps!
I have implemented the SIFT algorithm in OpenCV for feature detection and matching using the following steps:
Background Removal using Otsu's thresholding
Feature Detection using SIFT feature detector
Descriptor Extraction using SIFT feature extractor
Matching feature vectors using BFMatcher(L2 Norm) and using the ratio test to filter
good matches
My objective is to classify images into different categories such as shoes, shirts etc. based on their similarity. For example two different heels should be more similar to each other than a heel and a sports shoe or a heel and a t-shirt.
However this algorithm is working well only when my template image is present in the search image (in any scale and orientation). If I compare two different heels, they don't match well and the matches are also random(heel of one image matches to the flat surface of the other image). There are also many false positives when I compare a heel with a sports shoe or a heel with a t-shirt or a heel with the picture of a baby!
I would like to look at a heel and identify it as a heel and return how similar the heel is to different images in my database giving maximum similarity for other heels, then followed by other shoes. It should not produce any similarity with irrelevant objects such as shirts, phones, pens..
I understand that the SIFT algorithm produces a descriptor vector for each keypoint based on the gradient values of pixels around the keypoint and images are matched purely using this attribute. Hence it is highly possible that a keypoint located near the heel of one shoe is matched to a keypoint at the surface of the other shoe. Therefore, what I gather is that this algorithm can be used only to detect exact matches and not to detect similarity between images
Could you please tell me if this algorithm can be used for my objective and if I am doing something wrong or suggest any other approach that I should use.
For classification of similar objects, I certainly would go for cascade classifiers.
Basically, cascade classifiers is a machine learning method where you train your classifier to detect an object in different images. For it to work well, you need to train your classifier with a lot of positive (where your object is) and negative (where your object is not) images. The method was invented by Viola and Jones in 2001.
There is a ready-made implementation in OpenCV for face detection, you will have a bit more explanations on the openCV documentation (sorry, can't post the link, I'm limited to 1 link for the moment ..)
Now, for the caveats :
First, you need a lot of positive and negative images. The more images you have, the better the algorithm will perform. Beware of over-learning : if your training dataset for heels contains, for instance, too many images of a given model it is possible that others will not be detected properly
Training the cascade classifier can be long and difficult. The end-result will depend on how well you choose the parameters for training the classifier. Some info on this can be found on this webpage : http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html
I have a large image (5400x3600) that has multiple CCTVs that I need to detect.
The detection takes lot of time (4-7 minutes) with rotation. But it still fails to resolve certain CCTVs.
What is the best method to match a template like this?
I am using skImage - openCV is not an option for me, but I am open to suggestions on that too.
For example: in the images below, the template is correct matched with the second image - but the first image is not matched - I guess due to the noise created by the text "BLDG..."
Template:
Source image:
Match result:
The fastest method is probably a cascade of boosted classifiers trained with several variations of your logo and possibly a few rotations and some negative examples too (non-logos). You have to roughly scale your overall image so the test and training examples are approximately matched by scale. Unlike SIFT or SURF that spend a lot of time in searching for interest points and creating descriptors for both learning and searching, binary classifiers shift most of the burden to a training stage while your testing or search will be much faster.
In short, the cascade would run in such a way that a very first test would discard a large portion of the image. If the first test passes the others will follow and refine. They will be super fast consisting of just a few intensity comparison in average around each point. Only a few locations will pass the whole cascade and can be verified with additional tests such as your rotation-correlation routine.
Thus, the classifiers are effective not only because they quickly detect your object but because they can also quickly discard non-object areas. To read more about boosted classifiers see a following openCV section.
This problem in general is addressed by Logo Detection. See this for similar discussion.
There are many robust methods for template matching. See this or google for a very detailed discussion.
But from your example i can guess that following approach would work.
Create a feature for your search image. It essentially has a rectangle enclosing "CCTV" word. So the width, height, angle, and individual character features for matching the textual information could be a suitable choice. (Or you may also use the image having "CCTV". In that case the method will not be scale invariant.)
Now when searching first detect rectangles. Then use the angle to prune your search space and also use image transformation to align the rectangles in parallel to axis. (This should take care of the need for the rotation). Then according to the feature choosen in step 1, match the text content. If you use individual character features, then probably your template matching step is essentially a classification step. Otherwise if you use image for matching, you may use cv::matchTemplate.
Hope it helps.
Symbol spotting is more complicated than logo spotting because interest points work hardly on document images such as architectural plans. Many conferences deals with pattern recognition, each year there are many new algorithms for symbol spotting so giving you the best method is not possible. You could check IAPR conferences : ICPR, ICDAR, DAS, GREC (Workshop on Graphics Recognition), etc. This researchers focus on this topic : M Rusiñol, J Lladós, S Tabbone, J-Y Ramel, M Liwicki, etc. They work on several techniques for improving symbol spotting such as : vectorial signatures, graph based signature and so on (check google scholar for more papers).
An easy way to start a new approach is to work with simples shapes such as lines, rectangles, triangles instead of matching everything at one time.
Your example can be recognized by shape matching (contour matching), much faster than 4 minutes.
For good match , you require nice preprocess and denoise.
examples can be found http://www.halcon.com/applications/application.pl?name=shapematch