What are the values of the SURF descriptor? - opencv

I'm extracting SURF descriptors from an image using the following simple lines:
Ptr<DescriptorExtractor> descriptor = DescriptorExtractor::create("SURF");
descriptor->compute(im1, kp, desc1);
Now, when I "watch" the variable desc1.data, it contains integer values in the range 0 to 255.
However, when I investigate the values using the code:
for (int j=0;j<desc1.cols; j++){
float a=desc1.at<float>(0,j);
it contains values between -1 and 1. How is that possible? SURF shouldn't return integer values like SIFT?

I am not sure what happens in OpenCV, but as far the paper goes this is what SURF does. The SURF descriptor divides a small image patch into 4x4 sub regions and computes wavelet responses over each sub region in a clever fashion. Basically it is a 4 tuple descripor < sum(dx), sum(dy), sum(|dx|), sum(|dy|)>, where dx, dy are wavelet responses in each sub-region. The descriptor is constructed by concatenating all the responses and normalizing the magnitude, which results in a 64 dimensional descriptor. It is clear from the description that normalized sum(dx) and sum(dy) values would lie between -1 and 1, while sum(|dx|) and sum(|dy|) would lie between 0 to 1. If you see the 128 dimensional descriptor, the summation for dx and |dx| is computed separately for dx >= 0 and dy < 0, which doubles the size of the 64 dimensional descriptor.

Related

How to calculate 512 point FFT using 2048 point FFT hardware module

I have a 2048 point FFT IP. How may I use it to calculate 512 point FFT ?
There are different ways to accomplish this, but the simplest is to replicate the input data 4 times, to obtain a signal of 2048 samples. Note that the DFT (which is what the FFT computes) can be seen as assuming the input signal being replicated infinitely. Thus, we are just providing a larger "view" of this infinitely long periodic signal.
The resulting FFT will have 512 non-zero values, with zeros in between. Each of the non-zero values will also be four times as large as the 512-point FFT would have produced, because there are four times as many input samples (that is, if the normalization is as commonly applied, with no normalization in the forward transform and 1/N normalization in the inverse transform).
Here is a proof of principle in MATLAB:
data = randn(1,512);
ft = fft(data); % 512-point FFT
data = repmat(data,1,4);
ft2 = fft(data); % 2048-point FFT
ft2 = ft2(1:4:end) / 4; % 512-point FFT
assert(all(ft2==ft))
(Very surprising that the values were exactly equal, no differences due to numerical precision appeared in this case!)
An alternate solution from the correct solution provided by Cris Luengo which does not require any rescaling is to pad the data with zeros to the required length of 2048 samples. You then get your result by reading every 2048/512 = 4 outputs (i.e. output[0], output[3], ... in a 0-based indexing system).
Since you mention making use of a hardware module, this could be implemented in hardware by connecting the first 512 input pins and grounding all other inputs, and reading every 4th output pin (ignoring all other output pins).
Note that this works because the FFT of the zero-padded signal is an interpolation in the frequency-domain of the original signal's FFT. In this case you do not need the interpolated values, so you can just ignore them. Here's an example computing a 4-point FFT using a 16-point module (I've reduced the size of the FFT for brievety, but kept the same ratio of 4 between the two):
x = [1,2,3,4]
fft(x)
ans> 10.+0.j,
-2.+2.j,
-2.+0.j,
-2.-2.j
x = [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0]
fft(x)
ans> 10.+0.j, 6.499-6.582j, -0.414-7.242j, -4.051-2.438j,
-2.+2.j, 1.808+1.804j, 2.414-1.242j, -0.257-2.3395j,
-2.+0.j, -0.257+2.339j, 2.414+1.2426j, 1.808-1.8042j,
-2.-2.j, -4.051+2.438j, -0.414+7.2426j, 6.499+6.5822j
As you can see in the second output, the first column (which correspond to output 0, 3, 7 and 11) is identical to the desired output from the first, smaller-sized FFT.

Computer Vision: Calculate normalized value of descriptors

I am doing image classification project and i have made the corpus of features.
I want to normalize my features for the input of PyBrain between -1 to 1 I am using the following formula to normalize the features
Normalized value = (Value - Mean ) / Standard Deviation
but it is giving me the normalized some values between -3 to 3 which is very inaccurate.
I have 100 inputs in pybrain and 1 output of pybrain.
The equation you used is that of standardization. It does not guarantee your values are in -1;1 but it rescales your data to have a mean of 0, and a standard deviation of 1 afterwards. But points can be more than 1x the standard deviation from the mean.
There are multiple options to bound your data.
Use a nonlinear function such as tanh (very popular in neural networks)
center, then rescale with 1/max(abs(dev))
preserve 0, then rescale with 1/max(abs(dev))
2*(x-min)/(max-min) - 1
standardize (as you did) but truncate values to -1;+1
... many more
In case of you have positive dataset, you can normalize your values using this formula
Normalized value = (Value / (0.5*Max_Value) )-1;
This is going to give you values with the range [-1,+1]
In case you have positive and negative:
Normalized value = ((Normalized - Min_Value)/(Max_Value-Min_Value)-0.5)*2
Maybe you can do this:
Mid_value = ( Max_value + Min_Value )/2
Max_difference = ( Max_value - Min_Value )/2;
Normalized_value = ( Value - Mid_value )/Max_difference;
The Normalized_value shall be within [-1,+1].

How does BruteForce Feature Matching computes the "distance" value?

I wrote an application which detects keypoints, compute their descriptors and match them with BruteForce in OpenCV. That works like a charme.
But:
How is the distance in the match-objects computed?
For example: I'm using SIFT and get a descriptor vector with 128 float values per keypoint.
In matching, the keypoint is compared with for example 10 other descriptors with the same vectorsize.
Now, I get the "best match" with a distance of 0.723.
Is this the average of every single euclidean distance of all floats of one vector to another?
I just want to understand how this one value is created.
By default, from the Open-CV docs, the BFMatcher uses the L2-norm.
C++: BFMatcher::BFMatcher(int normType=NORM_L2, bool crossCheck=false )
Parameters:
normType – One of NORM_L1, NORM_L2, NORM_HAMMING, NORM_HAMMING2.
L1 and L2 norms are preferable choices for SIFT and SURF descriptors ...
See: http://docs.opencv.org/modules/features2d/doc/common_interfaces_of_descriptor_matchers.html?highlight=bruteforcematcher#bruteforcematcher
The best match is the feature vector with the lowest distance compared to all the others.

FFT with iOS vDSP not symmetrical

I'm using Apples vDSP APIs to calculate the FFT of audio. However, my results (in amp[]) aren't symmetrical around N/2, which they should be, from my understanding of FFTs on real inputs?
In the below frame is an array[128] of floats containing the audio samples.
int numSamples = 128;
vDSP_Length log2n = log2f(numSamples);
FFTSetup fftSetup = vDSP_create_fftsetup(log2n, FFT_RADIX2);
int nOver2 = numSamples/2;
COMPLEX_SPLIT A;
A.realp = (float *) malloc(nOver2*sizeof(float));
A.imagp = (float *) malloc(nOver2*sizeof(float));
vDSP_ctoz((COMPLEX*)frame, 2, &A, 1, nOver2);
//Perform FFT using fftSetup and A
//Results are returned in A
vDSP_fft_zrip(fftSetup, &A, 1, log2n, FFT_FORWARD);
//Convert COMPLEX_SPLIT A result to float array to be returned
float amp[numSamples];
amp[0] = A.realp[0]/(numSamples*2);
for(int i=1;i<numSamples;i++) {
amp[i]=A.realp[i]*A.realp[i]+A.imagp[i]*A.imagp[i];
printf("%f ",amp[i]);
}
If I put the same float array into an online FFT calculator I do get a symmetrical output. Am I doing something wrong above?
For some reason, most values in amp[] are around 0 to 1e-5, but I also get one huge value of about 1e23. I'm not doing any windowing here, just trying to get a basic FFT working initially.
I've attached a picture of the two FFT outputs, using the same data. You can see they are similar upto 64, although not by a constant scaling factor, so I'm not sure what they are different by. Then over 64 they are completely different.
Because the mathematical output of a real-to-complex FFT is symmetrical, there is no value in returning the second half. There is also no space for it in the array that is passed to vDSP_fft_zrip. So vDSP_fft_zrip returns only the first half (except for the special N/2 point, discussed below). The second half is usually not needed explicitly and, if it is, you can compute it easily from the first half.
The output of vDSP_fft_zrip when used for a forward (real to complex) transformation has the H0 output (which is purely real; its imaginary part is zero) in A.realp[0]. The HN/2 output (which is also purely real) is stored in A.imagp[0]. The remaining values Hi, for 0 < i < N/2, are stored normally in A.realp[i] and A.imagp[i].
Documentation explaining this is here, in the section “Data Packing for Real FFTs”.
To get symmetric results from strictly real inout to a basic FFT, your complex data input and output arrays have to be the same length as your FFT. You seem to be allocating and copying only half your data into the FFT input, which could be feeding non-real memory garbage to the FFT.

How does opencv store matrix value in Gaussian mixture? In which order?

I've searched the "bgfg_gaussmix2.cpp" code, it says in gaussian mixture model, it stores mixture weight (w), mean ( nchannels values ) and covariance for each gaussian mixture of each pixel background model. I want to know the order of its storage, for instance, is it "weight, mean, covariance", or " mean, covariance, weight", or something else?
Thanks in advance.
If you are speeking about the gaussian mixture structure CvPBGMMGaussian, the storing order is
Weight
mean dimension 1
mean dimension 2
mean dimension 3
Variance
The three dimensions are packed in a float array.
Here is the definition of this structure :
#define CV_BGFG_MOG2_NDMAX 3
typedef struct CvPBGMMGaussian
{
float weight;
float mean[CV_BGFG_MOG2_NDMAX];
float variance;
}CvPBGMMGaussian
If you are not speeking about this structure, please be more precise in your question.

Resources