How expensive is SIFT in terms of memory?

How expensive is SIFT in terms of memory? - image-processing

I'm trying to create the dataset of SIFT descriptors from the Oxford building dataset.
It's around 5k images and using the default with the largest size (width or height) of 1024pxs. Using the default VLFeat implementation, it generates on average 10k keypoints per image.
Now, let's do some math to compute the needed memory to store all the descriptors in memory:
10^3 (avg # keypoints per image) * 5 * 10^3 (images) * 128 (descriptor dimension) * 32 (float representation) / 8 (bytes) / 10^6 (GBs) = 2560 GBs
Wow, that's a lot of memory! :D So my question is pretty simple: did I write something wrong in the previous line, or we really need all this memory? It's weird because this is a really famous algorithm for object recognition. What's wrong?

In this answer, I will first say something about how you can reduce the number of features in a single image to a more reasonable amount than 10k. Next I try to explain how you can get around saving all extracted features. Oh, and to answer your question: I think you made a mistake when converting bytes to gigabytes: 1 gigabyte is 1024^3 bytes, so I calculate ~2.4 GB.
1. Reducing the number of features
You probably won't need 10k SIFT features per image for a reliable matching. Many of the detected features will correspond to very small gradients, and most likely be noise. The documentation of the vl_sift function gives you two ways to control the number of descriptor points:
using a peak threshold PeakThresh. This is a threshold which is applied on the difference of gradient (DoG) image, so only points with a large peak in the DoG, i.e. large and distinct gradients, are considered as features.
using the edge threshold edgethresh. As we only want key points and not whole edges, these are removed. This process can be controlled with the edgethresh parameter, where a higher value gives you more features.
This helps you reduce the number of features to a reasonable amount, i.e. an amount you can handle (how much this is depends on your infrastructure, your needs, and your patience.). This does work for small-scale retrieval applications. For large-scale image retrieval, however, you might have thousands or millions of images. Even if you only had 10 descriptors per image, you still would have difficulties to handle this.
Clustering into visual words
To solve this problem, you run a k-means clustering on the extracted descriptors, which gives you a number k of so-called "Visual Words" (following the terminology of text retrieval). Next, you build an inverse index, which is exactly the same as a book index: you save, which word is mentioned on which page of the book - or here: which visual word appears in which image.
For a new image, all you have to do is find the nearest visual word for each descriptor, and use the inverse index to see which database image contains the same visual words as your query image.
That way, all you have to save is the centroids of your visual words. The number of visual words can be anything, though I'd say something in the range between 100 and 10k is often reasonable.

Related

Defining a threshold for feature matching in geometrical re-ranking

I'm implementing a cache for virtual reality applications: given an input image query, return the result associated to the most visually similar cached image (so a previously processed query) if the distance between the query representation and the cached image representation is lower than a certain threshold. Our cache is relatively small and contains 10k images representations.
We use VLAD codes [1] as image representation since they are very compact and incredibly fast to compute (around 1 ms).
However, it has been shown in [2] that the the distance between the query code and the images in the dataset (the cache in this case) is very different from query to query, so it's not trivial to find an absolute threshold. In the same work it's proposed a method for object detection applications, which is not relevant in this context (we return just the most similar image, not all and only the images containing the query subject).
[3] offers a very precise method, but at the same time it's very expensive and returns short lists. It's based on spatial feature matching re-ranking, and if you want to know more details the quoted section is at the end of this question. I'm not an expert in computer vision, but this step sounds to me a lot like using a Feature Matcher on the short-list of the top-k elements according to the image representation and re-rank them based on the number of features matched. My first question is: is that correct?
In our case this approach is not a problem, since most of the times the top-10 most similar VLAD codes contains the query subject, and so we should do this spatial matching step only on 10 images.
However, at this point I have a second question: if we had the problem of deciding an absolute threshold for image representations (as VLAD codes), does this problem still persists with this approach? In the first case, the threshold was "the L2 distance between the query VLAD code and the closest VLAD code", here instead the threshold value would represent "the number of features matched between the query image and the image closest image using VLAD codes".
Of course my second question makes sense if the first question is positive.
The approach of [3]:
Geometrical Re-ranking verifies the global geometrical consistency
between matches (Lowe 2004; Philbin et al. 2007) for a short-list of
database images returned by the image search system. Here we implement
the approach of Lowe (2004) and apply it to a short-list of 200
images. We first obtain a set of matches, i.e., each descriptor of the
query image is matched to the 10 closest ones in all the short-list
images. We then estimate an affine 2D transforma- tion in two steps.
First, a Hough scheme estimates a trans- formation with 4 degrees of
freedom. Each pair of matching regions generates a set of parameters
that “vote” in a 4D histogram. In a second step, the sets of matches
from the largest bins are used to estimate a finer 2D affine
transform. The images for which the geometrical estimation succeeds
are returned in first positions and ranked with a score based on the
number of inliers. The images for which the estima- tion failed are
appended to the geometrically matched ones, with their order
unchanged.

Bag of Features / Visual Words + Locality Sensitive Hashing

PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?

It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!

How big of a Dataset can ELKI handle?

I have 100,000 points that I would like to cluster using the OPTICS algorithm in ELKI. I have a upper triangular distance matrix of about 5 billion entries for this point set. In the format that ELKI wants the matrix, it will take about 100GB in memory. I am wondering does ELKI handle that sort of data load? Can any one confirm if you have made this work before?

I frequently use ELKI with 100k points, up to 10 million.
However, for this to be fast you should use indexes.
For obvious reasons, any dense matrix based approach will scale at best O(n^2), and need O(n^2) memory. Which is why I cannot process these data sets with R or Weka or scipy. They usually first try to compute the full distance matrix, and either fail halfway through, or run out of memory halfway through, or fail with negative allocation size (Weka, when your data set overflows the 2^31 positive integers, i.e. is around 46k objects).
In the binary format, with float precision, the ELKI matrix format should be around 100000*999999/2*4 + 4 bytes, maybe add another 4 bytes for size information. This is 20 GB.
If you use the "easy to use" ascii format, then it will indeed be more. But if you use gzip compression it may end up being about the same size. It's common to have gzip compress such data to 10-20% of the raw size. In my experience gzip compressed ascii can be as small as binary encoded doubles.
The main benefit of the binary format is that it will actually reside on disk, and memory caching will be handled by your operating system.
Either way, I recommend to not compute distance matrixes at all in the first place.
Because if you decide to go from 100k to 1 million, the raw matrix would grow to 2 TB, and when you go to 10 million it will be 200 TB. If you want double precision, double that.
If you are using distance matrixes, your method will be at best O(n^2), and thus not scale. Avoiding computing all pairwise distances in the first place is an important speed factor.
I use indexes for everything. For kNN or radius bound approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low epsilon!) you can precompute these queries once, if you are going to need them repeatedly.
On a data set I frequently use, with 75k instances and 27 dimensions, the file storing the precomputed 101 nearest neighbors + ties, with double precision, is 81 MB (note: this can be seen as a sparse similarity matrix). By using an index for precomputing this cache, it takes just a few minutes to compute; and then I can ran most kNN based algorithms such as LOF on this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing the raw input data 2364 ms, for a total runtime of 3 seconds; dominated by parsing double values).

How to speed up svm.predict?

I'm writing a sliding window to extract features and feed it into CvSVM's predict function.
However, what I've stumbled upon is that the svm.predict function is relatively slow.
Basically the window slides thru the image with fixed stride length, on number of image scales.
The speed traversing the image plus extracting features for each
window takes around 1000 ms (1 sec).
Inclusion of weak classifiers trained by adaboost resulted in around
1200 ms (1.2 secs)
However when I pass the features (which has been marked as positive
by the weak classifiers) to svm.predict function, the overall speed
slowed down to around 16000 ms ( 16 secs )
Trying to collect all 'positive' features first, before passing to
svm.predict utilizing TBB's threads resulted in 19000 ms ( 19 secs ), probably due to the overhead needed to create the threads, etc.
My OpenCV build was compiled to include both TBB (threading) and OpenCL (GPU) functions.
Has anyone managed to speed up OpenCV's SVM.predict function ?
I've been stuck in this issue for quite sometime, since it's frustrating to run this detection algorithm thru my test data for statistics and threshold adjustment.
Thanks a lot for reading thru this !

(Answer posted to formalize my comments, above:)
The prediction algorithm for an SVM takes O(nSV * f) time, where nSV is the number of support vectors and f is the number of features. The number of support vectors can be reduced by training with stronger regularization, i.e. by increasing the hyperparameter C (possibly at a cost in predictive accuracy).

I'm not sure what features you are extracting but from the size of your feature (3780) I would say you are extracting HOG. There is a very robust, optimized, and fast way of HOG "prediction" in cv::HOGDescriptor class. All you need to do is to
extract your HOGs for training
put them in the svmLight format
use svmLight linear kernel to train a model
calculate the 3780 + 1 dimensional vector necessary for prediction
feed the vector to setSvmDetector() method of cv::HOGDescriptor object
use detect() or detectMultiScale() methods for detection
The following document has very good information about how to achieve what you are trying to do: http://opencv.willowgarage.com/wiki/trainHOG although I must warn you that there is a small problem in the original program, but it teaches you how to approach this problem properly.

As Fred Foo has already mentioned, you have to reduce the number of support vectors. From my experience, 5-10% of the training base is enough to have a good level of prediction.
The other means to make it work faster:
reduce the size of the feature. 3780 is way too much. I'm not sure what this size of feature can describe in your case but from my experience, for example, a description of an image like the automobile logo can effectively be packed into size 150-200:
PCA can be used to reduce the size of the feature as well as reduce its "noise". There are examples of how it can be used with SVM;
if not helping - try other principles of image description, for example, LBP and/or LBP histograms
LDA (alone or with SVM) can also be used.
Try linear SVM first. It is much faster and your feature size 3780 (3780 dimensions) is more than enough of "space" to have good separation in higher dimensions if your sets are linearly separatable in principle. If not good enough - try RBF kernel with some pretty standard setup like C = 1 and gamma = 0.1. And only after that - POLY - the slowest one.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?

Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.

You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)

http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.

The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.

It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.

Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart