opencv kmeans doesn't classify data in some classes - opencv

I'm trying to implement Scalable Recognition with a Vocabulary Tree
and I'm using opencv kmeans function to cluster feature vectors so I put all my vectors in one Mat object and pass it to the function like this:
TermCriteria criteria;
criteria.epsilon = 0.1;
int attempts = 1;
int flags = KMEANS_RANDOM_CENTERS;
int K = 10;
Mat Centers;
Mat Labels;
kmeans(descriptors, K, Labels, criteria, attempts, flags, Centers);
So in the function fills "Centers" and "Labels" Mat objects like this:
Centers has K rows, 64 columns (I'm using SURF features) and one channel
Labels has as many rows as "descriptors", one column and one channel and it's values are in the range of [0 K-1]
These are the things I have checked. After I do this to all the vectors I copy vectors with the same label to a new Mat and pass it to the function again.
My problem is that sometimes one of the values in the range [0 k-1] is missing in "Label" so none of the feature vectors is classified in that cluster. I've checked it for different K's and It usually happens at least once at some level (never in the first call though). Even for K = 3.
I assume at those times the data I pass to the function is not right. So my question is that when could this happen? What things should I check on the data that I pass to the function to make sure they are valid?
Also if you have a link of any good implementations of the paper I would really appreciate it if you post it here.

It turned out that some times some clusters have less than K number of members in them so in the next level the function returns an error. Though I still haven't figured out why sometimes a cluster is empty.

Related

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!
Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.

Batch Normalization in Convolutional Neural Network

I am newbie in convolutional neural networks and just have idea about feature maps and how convolution is done on images to extract features. I would be glad to know some details on applying batch normalisation in CNN.
I read this paper https://arxiv.org/pdf/1502.03167v3.pdf and could understand the BN algorithm applied on a data but in the end they mentioned that a slight modification is required when applied to CNN:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini- batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effec- tive mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation. Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
I am total confused when they say
"so that different elements of the same feature map, at different locations, are normalized in the same way"
I know what feature maps mean and different elements are the weights in every feature map. But I could not understand what location or spatial location means.
I could not understand the below sentence at all
"In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations"
I would be glad if someone cold elaborate and explain me in much simpler terms
Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels. An index (x, y) where 0 <= x < H and 0 <= y < W is a spatial location.
Usual batchnorm
Now, here's how the batchnorm is applied in a usual way (in pseudo-code):
# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)
Basically, it computes H*W*C means and H*W*C standard deviations across B elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B values.
Batchnorm in conv layer
This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.
Here's how the code looks like in this case (again pseudo-code):
# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)
In total, there are only C means and standard deviations and each one of them is computed over B*H*W values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").
Some clarification on Maxim's answer.
I was puzzled by seeing in Keras that the axis you specify is the channels axis, as it doesn't make sense to normalize over the channels - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is usually not what you want - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.
This is why you want C means and stds, because you want a mean and std per channel/feature.
After checking and testing it myself I realized the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.).
I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.
Hope this helps anyone who was confused as I was.
I'm only 70% sure of what I say, so if it does not make sense, please edit or mention it before downvoting.
About location or spatial location: they mean the position of pixels in an image or feature map. A feature map is comparable to a sparse modified version of image where concepts are represented.
About so that different elements of the same feature map, at different locations, are normalized in the same way:
some normalisation algorithms are local, so they are dependent of their close surrounding (location) and not the things far apart in the image. They probably mean that every pixel, regardless of their location, is treated just like the element of a set, independently of it's direct special surrounding.
About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations: They get a flat list of every values of every training example in the minibatch, and this list combines things whatever their location is on the feature map.
Firstly we need to make it clear that the depth of a kernel is determined by previous feature map's channel num, and the number of kernel in this layer determins the channel num of next feature map (the next layer).
then we should make it clear that each kernel(three dimentional usually) will generate just one channel of feature map in the next layer.
thirdly we should try to accept the idea of each points in the generated feature map (regardless of their position) are generated by the same kernel, by sliding on previous layer. So they could be seen as a distribution generated by this kernel, and they could be seen as samples of a stochastic variable. Then they should be averaged to obtain the mean and then the variance. (it not rigid, only helps to understand)
This is what they say "so that different elements of the same feature map, at different locations, are normalized in the same way"

hierarchical clustering using flann in opencv

I'm trying to use a method hierarchicalClustering from opencv 2.4.2.
It work without error, but the problem is, that I don't undertstand the parametrs it accepts eg. branching...
And i think it couses my problem that i get always just one cluster.
My input is a cv::Mat of LBPH features (for face detection) number of rows is 12 and number of cols is 6272.
No matter what is the value of branching factor I get always just one cluster and its centroid is mean of rows from input matrix grouppeed_one_ferson_features.
Could you advice ???
THANK a LOT!!!
heres the code:
cv::Mat groupped_one_person_features;
.... // fill grouppeed_one_ferson_features with data
int Nclusters=50;
cv::Mat centroids (Nclusters,Features.data[0][0].cols,CV_32FC1);
int count = cv::flann::hierarchicalClustering<cvflann::L1<float>>groupped_one_person_features,centroids,cvflann::KMeansIndexParams(2000,11,cvflann::FLANN_CENTERS_KMEANSPP));
First of all, you missed a parenthesis in your last line:
int count = cv::flann::hierarchicalClustering<cvflann::L1<float>>(groupped_one_person_features,centroids,cvflann::KMeansIndexParams(2000,11,cvflann::FLANN_CENTERS_KMEANSPP));
In the order, the parameters are (according to flann_base.hpp):
The points to be clustered
The computed cluster centers. Matrix should be preallocated and centers.rows is the number of clusters requested.
The clustering parameters
The distance to be used for clustering
Therefore, if you always get one cluster, it possibly means that your centroids matrix only has one row. Can you verify this?
The parameters of KMeansIndexParams are (according to kmeans_index.h):
branching factor: the number of children of a node in the tree
iterations: max iterations to perform in one kmeans clustering (kmeans tree)
centers_init: algorithm used for picking the initial cluster centers for kmeans tree
cb_index: cluster boundary index. Used when searching the kmeans tree

OpenCV 2.4.3 PCA class - when number of samples is less than number of dimensions

I'm trying to use the PCA class in OpenCv to perform the principal component analysis operation in my C++ application . I'm new to OpenCV and I'm having a problem So I wish if someone could help.
I'm trying a demo Example on both Matlab and the PCA class to check the answers
when I'm using 2*10 data array, and the parameter (CV_PCA_DATA_AS_COL), here I'm having two dimensions so I'm expecting to have 2 Eigenvectors each has 2 elements, and this worked fine as expected with the same results as Matlab.
But while using 10*2 data array (generally when number of samples is less than number of dimension), I get (2*10) array of eiegnvectors. I.e: 10 eigenvectors with 2 elements each. This is not expected and it's not the result given by Matlab (Matlab give 10*10 matrix of eigenvectors).
I don't know why I'm having those results and due this I can't project the Data on principal components in my application, any help?
P.S : The code I used :
Mat Mean ;
Mat H(10, 2, CV_32F); // then the matrix is filled by data
PCA pca(H,Mean,CV_PCA_DATA_AS_COL,0) ;
pca.operator()(H,Mean,CV_PCA_DATA_AS_COL,0) ;
cout<<pca.eigenvectors.rows // gives 2 instead of 10
cout<<pca.eigenvectors.cols // gives 10
I'd state it as follows:
If the number of samples is less than the data dimension then the number of retained components will be clamped at the number of samples.
We did 3x3 PCA for mechanics subject at uni, also some non-linear control algorithms used similar approaches - my memory is foggy, but it may have something to do with assumptions regarding psuedo-inverses and non-square matrices...
Once you delve into the theory - websearch 'pca with less samples than dimensions' - it gets messy fast!

Responses of machine learning functions?

When going through any of the machine learning functions explained here. They all follow the format of cvStatModel.
For example the train function of NormalBayes is achieved by:
CvNormalBayesClassifier::train(const Mat& trainData, const Mat& responses, const Mat& varIdx=Mat(), const Mat& sampleIdx=Mat(), bool update=false )
The documentation tells you to check out cvStatModel for details on parameters.
What I dont understand is what is responses supposed to take? I know that trainData is the data we used for training the system using bag of words, but what to place in responses?
In an example on bag of words the responses element was handled as follows:
float label=atof(entryPath.filename().c_str());
labels.push_back(label);
NormalBayesClassifier classifier;
classifier.train(trainingData, labels);
So here the filenames of the images were converted to doubles and used as the responses element.
I don't understand this and am confused by it. Can some one please explain what the responses element is supposed to take? and why is atof used in the above example?
Those models are supervised machine learning techniques, it means that training the model requires not only the training data (i.e. the vectors of measurements), but also the labels (or continuous values) associated with each sample. For example, if you are trying to detect images containing cats, you have a training set of, say, 500 images not containing cats and 500 containing cats. You compute your descriptors for all 1000 images, and you assign a number to each category (by convention, -1 for "non-cats", 1 for "cats). Then, responses will be a matrix of 1000x1 integers, the first 500 values being -1, while the remaining beeing 1.
In you example, atof is used to convert a directory name to a unique number, representing the category, because training examples are probably sorted by folders (folder cats, dogs, bicycles, etc).

Resources