Gaussian Weighting for point re-distribution - opencv

I am working with some points which are very compact together and therefore forming clusters amongst them is proving very difficult. Since I am new to this concept, I read in a paper about the concept of Gaussian weighting the points randomly or rather resampling using gaussian weight.
My question here is how are gaussian weight applied to the data points? Is it the actual normal distribution where I have to compute the means and the variance and SD and than randomly sample or there is other ways to do it. I am confused on this concept?
Can I get some hints on the concept please

I think you should look at book:
http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738
There is a good chapters on modeling point distributions.

Related

How to measure distance between Fisher Vector for Image Retrieval?

I've read something about Fisher Vector and I'm still in the learning process. It's a better representation than the classic BoF representation, exploiting GMM (or k-means, even if that's usually referred as VLAD).
However, I've seen that usually they are used for classification problem, for example with SVM.
But what about Image Retrieval? I've seen that they have been used for image retrieval too (here), but I don't understand one point: given two FV representing 2 images, how do we compute their distances and so "how similar the two images are?"
Is it reasonable to use them in such a context?
As seen in the two papers below, Euclidean distance seems to be the popular choice. There are also references to using dot-product as a similarity measure; cosine similarity (closely related) is a generally popular metric for ML similarity.
http://link.springer.com/article/10.1007/s11263-013-0636-x
http://www.robots.ox.ac.uk/~vgg/publications/2013/Simonyan13/simonyan13.pdf
Is this enough to let you choose something that meets your needs?

Bag of Features / Visual Words + Locality Sensitive Hashing

PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

In scikit-learn, can DBSCAN use sparse matrix?

I got Memory Error when I was running dbscan algorithm of scikit.
My data is about 20000*10000, it's a binary matrix.
(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)
Anyway I found sparse matrix and feature extraction of scikit.
http://scikit-learn.org/dev/modules/feature_extraction.html
http://docs.scipy.org/doc/scipy/reference/sparse.html
But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?
If anyone knows how to use sparse matrix in DBSCAN, please tell me.
Or you can tell me a more suitable cluster method.
The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.
As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.
May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)
Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).
Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.
You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:
from sklearn.metrics.pairwise import euclidean_distances
D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)
However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.
(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)
Yes, since version 0.16.1.
Here's a commit for a test:
https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879
Sklearn's DBSCAN algorithm doesn't take sparse arrays. However, KMeans and Spectral clustering do, you can try these. More on sklearns clustering methods: http://scikit-learn.org/stable/modules/clustering.html

ML / density clustering on house areas. two-component or more mixtures in each dimension

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!
What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.
Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.
In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

Resources