I was studying Viola-Jones paper for better understanding of their object detection algorithm and producing an applicable program. In the last paragraph of features' topic, authors talk about the base resolution of the detector which is 24x24, they say the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle features is overcomplete. Is this mean that every single rectangle feature is 24 by 24 or it simply means that we divide a given image into 24*24 blocks? 180000 is the result of finding several types of Haar-like features for every 24*24 block? And I also couldn't understand the last part which states the set of rectangle features is overcomplete. what does being overcomplete mean when we are talking about rectangle features? Thanks.
Every 24X24 rectangle feature gives you only one number as stated before in the same paragraph "The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions" and "A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles."
An explanation about the number 180,00 you can find in:
Viola-Jones' face detection claims 180k features
An overcomplete set means that you have some features that are a linear combination of other features. In the case of 24X24 rectangle features we can build a linear base for this space by taken all the rectangles with value 1 in one of their squares and zero in all the rest. If we calculate how many option this configuration has we get 24*24=576 which is much less than 180,000. This means that from their set of 180,000 we have some rectangles that we can get as combination of other rectangles from our set.
Related
I have go question while programming K-means algorithm in Matlab. Why K-means algorithm not suitable for classifying elongated data set?
In sort, draw some thick lines on a paper. Can you really represent each one with a single point? How would single points give information about orientation?
K-means assigns each datapoint to each nearest centroid. That is to say that for each centroid c, all points that their distance from c is smaller (in comparison to all other centroids) will be assigned to c. And, since the surface of a (hyper)sphere is in fact, all points with distance less or equal to some value from a center, I think it is easy to see how resulted clusters tend to be spherical. (To be exact, kmeans practically creates a Voronoi diagram in the vector space)
Elongated clusters however, don't necessarily satisfy the requirement that all their points are closer to their "center of mass" than to some other cluster's center.
It is difficult for you to choose a init cluster center point in elongated data set, but it has a powerful effect on the result.You may get different results when choose different points.
You will get only one result in this case when you choose 3 init points:
But it is different in elongated data set.
Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).
I would like to create an application which can learn to classify a sequence of points drawn by a user, e.g. something like handwriting recognition. If the data point consists of a number of (x,y) pairs (like the pixels corresponding to a gesture instance), what are the best features to compute about the instance which would make for a good multi-class classifier (e.g. SVM, NN, etc)? Particularly if there are limited training examples provided.
If I were you, I would find the data points that correspond with corners, end points and intersections, use those as features and discard the intermediate points. You could include the angle or some other descriptor of these interest points as well.
For detecting interest points you could use a Harris detector, you could then use the gradient value at that point as a simple descriptor. Alternatively you could go with a more fancy method like SIFT.
You could use the descriptor of every pixel in your downsampled image and then classify with SVM. The disadvantage of that is that there would be a large amount of uninteresting data points in the feature vector.
An alternative would be to not approach it as a classification problem but as a template matching problem (fairly common in computer-vision). In this case a gesture can be specified as an arbitrary number of interest points, completely leaving out the non-interesting data. A certain threshold percentage of an instance's points has to match a template for a positive identification. For example, when matching the corner points of an instance of 'R' against the template for 'X', the bottom right point should match, being end points in the same position orientation, but the others are too dissimilar, giving a fairly low score and the identification R=X will be rejected.
I'm using OpenCV 2.3 for keypoints detection and matching. But I am a bit confused with the size and response parameters given by the detection algorithm. What do they exactly mean?
Based on the OpenCV manual, I can't figure it out:
float size: diameter of the meaningful keypoint neighborhood
float response: the response by which the most strong keypoints have
been selected. Can be used for further sorting or subsampling
I thought the best point to track would be the one with the highest response but it seems that it is not the case. So how could I subsample the set of key points returned by the surf detector to keep only the best one in term of trackability?
Size and response
SURF is a blob detector, in short, the size of a feature is the size of the blob. To be more precise, the returned size by OpenCV is half the length of the approximated Hessian operator. The size is also known as scale, this is due to the way the blob detectors work, i.e., being functionally equal to first blurring the image with the Gaussian filter at several scales and then downsampling the images and finally detecting blobs with a fixed size. See the image below showing the the size of the SURF features. The size of each feature is the radius of the drawn circle. The lines going out from the center of the features to the circumference show the angles or orientations. In this image, the response strength of the blob detection filter is color coded. You can see the majority of the detected features have a weak response. (see the full size image here)
This histogram shows the distribution of the response strengths of the features in the above image:
What features to track?
The most robust feature tracker tracks all the detected features. The more features the more robustness. But it's impractical to track a large number of features as often we want to limit the computation time. The number of features to track often should be empirically tuned for each application. Often the image is divided into regular sub-regions and in each one the n strongest features are kept to be tracked. n is usually chosen such that in total about 500~1000 features are detected per frame.
References
Reading the journal paper describing SURF definitely will give you a good idea of how it works. Just try not to get stuck in the details, specially if your background isn't in machine/computer vision or image processing. The SURF detector may seem extremely novel at the first glance but the whole idea is estimating the Hessian operator (a well established filter) using integral images (which have been used by other methods long before SURF). If you want to understand SURF very well and you're not familiar with image processing, you need to go back and read some introductory material. Recently I came across a new and free book, whose chapter 13 has a good and brief introduction to feature detection. Not everything said in there is technically correct, but it's a good starting point. Here you can find another good description of SURF with several images showing how each step works. On that page you see this image:
You can see the white and black blobs, these are the blobs that SURF detects at several scales and estimates their sizes (radius in the OpenCV code).
"size" is the size of the area covered by the descriptor in the original image (it is obtained by downsampling the original image in the scale space, hence it varies from key point to key point based on their scale).
"reponse" is indeed an indicator of "how good" (roughly speaking, in terms of corner-ness) a point is.
Good points are stable for static scene retrieval (this is the main purpose of SIFT/SURF descriptors). In the case of tracking, you can have good points appearing because the tracked object is on a well formed background, of half in the shadow... then disappearing because this condition has changed (change of light, occlusion...). So there is no guarantee for tracking tasks that a good point will always be there.
My lecturer has slides on edge histograms for image retrieval, whereby he states that one must first divide the image into 4x4 blocks, and then check for edges at the horizontal, vertical, +45°, and -45° orientations. He then states that this is then represented in a 14x1 histogram. I have no idea how he came about deciding that a 14x1 histogram must be created. Does anyone know how he came up with this value, or how to create an edge histogram?
Thanks.
The thing you are referring to is called the Histogram of Oriented Gradients (HoG). However, the math doesn't work out for your example. Normally you will choose spatial binning parameters (the 4x4 blocks). For each block, you'll compute the gradient magnitude at some number of different directions (in your case, just 2 directions). So, in each block you'll have N_{directions} measurements. Multiply this by the number of blocks (16 for you), and you see that you have 16*N_{directions} total measurements.
To form the histogram, you simply concatenate these measurements into one long vector. Any way to do the concatenation is fine as long as you keep track of the way you map the bin/direction combo into a slot in the 1-D histogram. This long histogram of concatenations is then most often used for machine learning tasks, like training a classifier to recognize some aspect of images based upon the way their gradients are oriented.
But in your case, the professor must be doing something special, because if you have 16 different image blocks (a 4x4 grid of image blocks), then you'd need to compute less than 1 measurement per block to end up with a total of 14 measurements in the overall histogram.
Alternatively, the professor might mean that you take the range of angles in between [-45,+45] and you divide that into 14 different values: -45, -45 + 90/14, -45 + 2*90/14, ... and so on.
If that is what the professor means, then in that case you get 14 orientation bins within a single block. Once everything is concatenated, you'd have one very long 14*16 = 224-component vector describing the whole image overall.
Incidentally, I have done a lot of testing with Python implementations of Histogram of Gradient, so you can see some of the work linked here or here. There is also some example code at that site, though a more well-supported version of HoG appears in scikits.image.