Why KeyPoint "detector" and "extractor" are different operation? - opencv

Bascially you have first to do a:
SurfFeatureDetector surf(400);
surf.detect(image1, keypoints1);
And then a:
surfDesc.compute(image1, keypoints1, descriptors1);
Why detect and comput are 2 different operation?
Doing the computing after the detect doesn't make redundance loops?
I found myself that .compute is the most expensive in my application.
.detect
is done in 0.2secs
.compute
Takes ~1sec. Is there any way to speed up .compute ?

The detection of keypoints is just a process for selecting points in the image that are considered "good features".
The extraction of descriptors of those keypoints is a completely different process that encodes properties of that feature like contrast with neighbours etc so it can be compared with other keypoints from different images, different sclae and orientation.
The way you describe a keypoint could be crucial for a successful matching, and this is really the key factor. Also the way you describe a keypoint is determinant for the matching speed. For example you can describe it as a float or as a binary secuence.

There is a difference between detecting the keypoints in an image and computing the descriptors for those keypoints. For example you can extract SURF keypoints and compute SIFT features. Note that in DescriptorExtractor::compute method, filters on keypoints are applied:
KeyPointsFilter::runByImageBorder()
KeyPointsFilter::runByKeypointSize();

Picking up from where Jay_Rock left, you can improve those processing times by using a binary descriptor offered by algorithms like ORB, Brisk or FREAK. Not only will they occupy 32 bit instead of 64, they also offer different methods for computing descriptors that are as robust as SURF's and much faster.
If you eventually want to perform matching operations between descriptors, this is done by calculating the Hamming distance between both. Given that it's a XOR operation between two binary strings, it takes only a few milliseconds to run.

Related

what is the role of <cv2.face.LBPHFaceRecognizer_create() >

i know that cv2.face.LBPHFaceRecognizer_create() use it for recognize face in real time
but i want to know what its fonction?,what exist inside this instruction ? how it is work?
i want to know what itss struct for exemple it is take the image and extract caractrestic in forme lbph and its use for that .... than train image for that its use (name of trainer) compare the images for can recognise them.
any information or document can help me please pratge with me
LBP(Local Binary Patterns) are one way to extract characteristic features of an object (Could be face, coffe cup or anything that has a representation). LBP's algorithm is really straight forward and can be done manually. (pixel thresholding + pixel level arithmetic operations.)
LBP Algorithm:
There is a "training" part in OpenCV's FaceRecognizer methods. Don't make this confuse you, there is no deep learning approach here. Just simple math.
OpenCV transforms LBP images to histograms to store spatial information of faces with the representation proposed by Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary patterns. In Computer vision-eccv 2004, pages 469–481. Springer, 2004. . Which divides the LBP image to local regions sized m and extracting histogram for each region and concatenating them.
After having informations about one person's (1 label's) face, the rest is simple. During the inference, it calculates the test face's LBP, divides the regions and creates a histogram. Then compares its euclidian distance to the trained faces' histograms. If it's less than the tolerance value, it counts as a match. (Other distance methods can be used also, chi-squared distance absolute value etc. )

SIFT features and classification of images?

I am new to image processing, and I want to extract image features in order to do some classification. I am having problems understanding the pipeline.
As far as I understand, I have a images and I run the SIFT algorithm on them. This gives me a set of descriptors for each images, the number varies, with fixed length of 128.
I then proceed to cluster them, since it is not possible to apply algorithms on varying number of features. For this, I stack up all the descriptors of all images and I run the k means algorithm with the desired number of clusters. What I get are k number of features of length 128.
Here is where I am confused, so I now have these new descriptors right, what do I do with them? I don't understand how I can plug them into a classifier if they represent all images? Should each images have their own separate features to be fed into a classifier?
I am sure I did not understand the concept, but can anybody please clarify what happens after I get a k*128 sized matrix? What is fed into for example an SVM classifier and how? How does this k means result suffice to train a classifier?
Thanks!
EDIT: I might have confused keypoints and descriptors, sorry new to image processing!
You should look into the image classification/image retrieval approach known as 'bag of visual words' - it is extremely relevant. A bag of visual words is a fixed-length feature vector v which summarises the occurrences of the features in an image. This makes use of what is called a codebook (also called a dictionary from historical uses in text retrieval), which in your case is built from your K-means clustering. To make v for a given image, the simplest approach is to assign v[j] the proportion of SIFT descriptors that are closest to the jth cluster centroid. This means the length of V is K, so it is independent of the number of SIFT features that are detected in the image.
Concretely, suppose you've done K means clustering with K = 100. Let's use ci to denote the ith cluster centre. For SIFT, this would be a vector of size 128. Now, for a given input image, you make this vector v, which is of size 100 and initialized with zeros. You then extract features from the image, and their corresponding descriptors. Let's say there are N descriptors, and we will call them d0, d2,...,d(N-1), where dj is the jth descriptor. For each dj you compute the vector distance between it and the cluster centres c0, c2,...c99. You then take the cluster index k with the lowest distance to dj, and increment: v[k]+=1. Note that this process can be parallelised very well particularly on GPUs. Also it can be faster to replace this process using what is known as Approximate Nearest Neighbours, using e.g. the FLANN library.

Regarding the number of features extracted from an image for training

I am building a software to classify cells from images taken by a microscope.
I have a dataset of images of cells to use as training dataset - I have extracted Keypoints from each image using ORB - Here is my problem - some image produce a lot of keypoints and some small number of keypoints. Thus the descriptor vectors produced are of different lentgh. So when i try to build a training matrix from them i have to 'Normalize' the number of Keypoints chosen from each Image so that the length of the descriptor vectors will be the same.
How many key points should i pick and which? how to pick the 'Best' Keypoints? (this question also rises when i want to preform a prediction on an object i want to classify) are there known approaches to this problem?
Regards.
You could use bag of words approach to classify your images. You first have to collect all keypoint descriptors and cluster them into a certain number of groups. Each group (cluster) is your word. A descriptor corresponds to a word. For each image now, you can build a histogram by counting the occurrence of words. You can then normalize the histogram to remove the effect of varying number of keypoints in different images.
Using spatial pyramid matching could be another solution.
The simplest approach, as described by Ajay, is to cluster keypoints into N clusters and then define N binary features, such that for a given sample, feature i equals 1 if the sample shows a keypoint in cluster i, and 0 otherwise.
Another approach is to use a kernel classifier, like Support Vector Machines (SVM), and use a kernel that accepts variable-length vectors (e.g. Fisher kernel).

Is it possible to compute descriptor at a given position which may or may not be qualified as a keypoint?

I want to analyze the similarity of two images.
A conventional way for this is
Detect features(keypoints) for both images.
Compute descriptors for every keypoints.
Compute a match using these two sets of descriptors.
However in my case I have already had the matched point sets from two images.
So I think I can directly proceed to the second step for descriptor computation.
Is it reasonable and possible?
yes, its possible to compute descriptors for any point, be it a keypoint or not. After all, a descriptor is just a representation of intensities in a patch. For computing similarity, you could use bag of words.
A better way to compute image similarity is to compare mid level deep features, which can be easily computed using CAFFE caffe.berkeleyvision.org.

Searching an Image Database Using SIFT

Several questions have been asked about the SIFT algorithm, but they all seem focussed on a simple comparison between two images. Instead of determining how similar two images are, would it be practical to use SIFT to find the closest matching image out of a collection of thousands of images? In other words, is SIFT scalable?
For example, would it be practical to use SIFT to generate keypoints for a batch of images, store the keypoints in a database, and then find the ones that have the shortest Euclidean distance to the keypoints generated for a "query" image?
When calculating the Euclidean distance, would you ignore the x, y, scale, and orientation parts of the keypoints, and only look at the descriptor?
There are several approaches.
One popular approach is the so called bag of words representation which does matching based solely upon how many descriptors match, thus ignoring the location part consisting of (x, y, scale, and orientation) and just look at the descriptor.
Efficient querying of a large database may use approximate methods like locality sensitive hashing
Other methods may involve vocabulary trees or other data structures.
For an efficient method that also takes into account location information, check out pyramid match kernels

Resources