VLFeat: computation of number of octaves for SIFT - image-processing

I am trying to go through and understand some of VLFeat code to see how they generate the SIFT feature points. One thing that has me baffled early on is how they compute the number of octaves in their SIFT computation.
So according to the documentation, if one provides a negative value for the initial number of octaves, it will compute the maximum which is given by log2(min(width, height)). The code for the corresponding bit is:
if (noctaves < 0) {
noctaves = VL_MAX (floor (log2 (VL_MIN(width, height))) - o_min - 3, 1) ;
}
This code is in the function is in the vl_sift_new function. Here o_min is supposed to be the index of the first octave (I guess one does not need to start with the full resolution image). I am assuming this can be set to 0 in most use cases.
So, still I do not understand why they subtract 3 from this value. This seems very confusing. I am sure there is a good reason but I have not been able to figure it out.

The reason why they subtract by 3 is to ensure a minimum size of the patch you're looking at to get some appreciable output. In addition, when analyzing patches and extracting out features, depending on what algorithm you're looking at, there is a minimum size patch that the feature detection needs to get a good output and so subtracting by 3 ensures that this minimum patch size is met once you get to the lowest octave.
Let's take a numerical example. Let's say we have a 64 x 64 patch. We know that at each octave, the sizes of each dimension are divided by 2. Therefore, taking the log2 of the smallest of the rows and columns will theoretically give you the total number of possible octaves... as you have noticed in the above code. In our case, either the rows and columns are the minimum value, and taking the log2 of either the rows or columns gives us 7 octaves theoretically (log2(64) = 7). The octaves are arranged like so:
Octave | Size
--------------------
1 | 64 x 64
2 | 32 x 32
3 | 16 x 16
4 | 8 x 8
5 | 4 x 4
6 | 2 x 2
7 | 1 x 1
However, looking at octaves 5, 6 and 7 will probably not give you anything useful and so there's actually no point in analyzing those octaves. Therefore by subtracting by 3 from the total number of octaves, we will stop analyzing things at octave 4, and so the smallest patch to analyze is 8 x 8.
As such, this subtraction is commonly performed when looking at scale-spaces in images because this enforces that the last octave is of a good size to analyze features. The number 3 is arbitrary. I've seen people subtract by 4 and even 5. From all of the feature detection code that I have seen, 3 seems to be the most widely used number. So with what I said, it wouldn't really make much sense to look at an octave whose size is 1 x 1, right?

Related

What happens to Huffman in this case (compressing image)

I was wondering , what happens to Huffman coding when the pixels are similar, so basically Huffman uses probability of each symbol and worth through it.
what happens if the image was like this:
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
ect.
does Huffman coding fails here?
No, it doesn't fail at all. If your images were stored with eight bits per pixel, now they will be stored with, on average, 2.67 bits per pixel. Compressed by a factor of three.
While the symbols all have equal probability, there are only six of them. That permits Huffman coding to use fewer bits per symbol.

External linkage - what to do when there is a tie

I am considering to implement a complete linkage clustering algorithm from scratch for study purposes. I've seen that there is a big difference when compared to single linkage:
Unlike single linkage, the complete linkage method can be strongly affected by draw cases (where there are 2 groups/clusters with the same distance value in the distance matrix).
I'd like to see an example of distance matrix where this occurs and understand why it happens.
Consider the 1-dimensional data set
1 2 3 4 5 6 7 8 9 10
Depending how you do the first merges, you can get pretty good or pretty bad results. For example, first merge 2-3, 5-6 and 8-9. Then 2-3-4 and 7-8-9. Compare this to the "obvious" result that most humans would produce.

Anomaly detection with machine learning without labels

I am tracing multiple signals for a certain period of time and associating them with a timestamp like following:
t0 1 10 2 0 1 0 ...
t1 1 10 2 0 1 0 ...
t2 3 0 9 7 1 1 ... // pressed a button to change the mode
t3 3 0 9 7 1 1 ...
t4 3 0 8 7 1 1 ... // pressed button to adjust a certain characterstic like temperature (signal 3)
where t0 is the tamp stamp, 1 is the value for signal 1, 10 the value for signal 2 and so on.
That captured data during that certain period of time should be considered as the normal case. Now significant derivations should be detected from the normal case. With significant derivation I do NOT mean that one signal value just changes to a value that has not been seen during the tracing phase but rather that a lot of values change that have not yet been related to each other. I do not want to hardcode rules since in the future more signals might be added or removed and other "modi" that have other signal values might be implemented.
Can this be achieved via a certain Machine Learning algorithm? If a small derivation occurs I want the algorithm to first see it as a minor change to the training set and if it occurs multiple times in the future it should be "learned". The major goal is to detect the bigger changes / anomalies.
I hope I could explain my problem detailed enough. Thanks in advance.
you could just calculate the nearest neighbor in your feature space and set a threshold how far its allowed to be away from your test point to not be an anomaly.
Lets say you have 100 values in your "certain period of time"
so you use a 100 dimensional feature space with your training data (which doesn't contain anomalies)
If you get a new dataset you want to test, you calculate the (k) nearest neighbor(s) and calculate the (e.g. euclidean) distance in your featurespace.
If that distance is larger than a certain threshold it's an anomaly.
What you have to do in order to optimize is finding a good k and a good threshold. E.g. by Grid-search.
(1) Note that something like this probably only works well if your data has a fixed starting and ending point. Otherwise you would need a huge amount of data and even than it will not perform as good.
(2) Note It should be worth trying to create an own detector for every "mode" you have mentioned in your question.

Explanation for Values in Scharr-Filter used in OpenCV (and other places)

The Scharr-Filter is explained in Scharrs dissertation. However the values given on page 155 (167 in the pdf) are [47 162 47] / 256. Multiplying this with the derivation-filter would yield:
Yet all other references I found use
Which is roughly the same as the ones given by Scharr, scaled by a factor of 32.
Now my guess is that the range can be represented better, but I'm curious if there is an official explanation somewhere.
To get the ball rolling on this question in case no "expert" can be found...
I believe the values [3, 10, 3] ... instead of [47 162 47] / 256 ... are used simply for speed. Recall that this method is competing against the Sobel Operator whose coefficient values are are 0, and positive/negative 1's and 2's.
Even though the divisor in the division, 256 or 512, is a power of 2 and can can be performed by a shift, doing that and multiplying by 47 or 162 is going to take more time. A multiplication by 3 however can in fact be done on some RISC architectures like the IBM POWER series in a single shift-and-add operation. That is 3x = (x << 1) + x. (On these architectures, the shifter and adder are separate units and can be done independently).
I don't find it surprising that Phd paper used the more complicated and probably more precise formula; it needed to prove or demonstrate something, and the author probably wasn't totally certain or concerned that it be used and implemented alongside other methods. The purpose in the thesis was probably to have "perfect rotational symmetry". Afterwards when one decides to implement it, that person I suspect used the approximation formula and gave up a little on perfect rotational symmetry, to gain speed. That person's goal as I said was to have something that was competitive at the expense of little bit of speed for this rotational stuff.
Since I'm guessing you are willing to do work this as it is your thesis, my suggestion is to implement the original algorithm and benchmark it against both the OpenCV Scharr and Sobel code.
The other thing to try to get an "official" answer is: "Use the 'source', Luke!". The code is on github so check it out and see who added the Scharr filter there and contact that person. I won't put the person's name here, but I will say that the code was added 2010-05-11.

Bag of Visual Words in Opencv

I am using BOW in opencv for clustering the features of variable size. However one thing is not clear from the documentation of the opencv and also i am unable to find the reason for this question:
assume: dictionary size = 100.
I use surf to compute the features, and each image has variable size descriptors e.g.: 128 x 34, 128 x 63, etc. Now in BOW each of them are clustered and I get a fixed descriptor size of 128 x 100 for a image. I know 100 is the cluster center created using kmeans clustering.
But I am confused in that, if image has 128 x 63 descriptors, than how come it clusters into 100 clusters which is impossible using kmeans UNLESS i convert the descriptor matrix to 1D. Wont converting to 1D will lose valid 128 dimensional information of a single key points?
I need to know how is the descriptor matrix manipulated to get 100 cluter centers from only 63 features.
Think it like this.
You have 10 cluster means total and 6 features for current image. First 3 of those features are closest to 5th mean and remaining 3 of them are closest to 7th, 8th, and 9th mean respectively. Then your feature will be like [0, 0, 0, 0, 3, 0, 1, 1, 1, 0] or normalized version of this. Which is 10 dimensional, and that is equal to number of cluster mean. So you can create 100000 dimensional vector from 63 features if you want.
But still I think there is something wrong, because after you applied BOW, your features should be 1x100 not 128x100. Your cluster means are 128x1 and you are assigning your 128x1 sized features (you hvae 34 128x1 feature for first image, 63 128x1 feature for second image, etc.) to those means. So in basic you are assigning 34 or 63 features to 100 means, your result should be 1x100.

Resources