Cluster data with output centers of Kmeans function - opencv

Hi I have clustered some data with kmeans function and stored the centers of clusters that it produces as output. Now I have a new set of vectors in a Mat object and want to know to which cluster each vector belongs in. Is there a simple way to do that or should I just calculate the euclidean distances of each vector with all the centers and choose the cluster it is closest to.
If I should go for the second way, are there any efficiency considerations to make it fast?

It seems that you're interested in performing some type of cluster assignment using the results of running K-Means on an initial data set, right?
You could just assign the new observation to the closest mean. Unfortunately with K-Means you don't know anything about the shapes or size of each cluster. For example, consider a scenario where a new vector is equidistant (or roughly equidistant) from two means. What do yo do in this scenario? Do you make a hard assignment to one of the clusters?
In this situation its probably better to actually look at the original data that comprises each of the clusters, and do some type of K-Nearest Neighbor assignment (http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). For example, it may turn out that while the new vector is roughly equidistant from two different cluster centers, it is much closer to the data from one of the clusters (indicating that it likely belongs to that cluster).
As an alternative to K-Means, if you used some like Mixture of Gaussians with EM, you'd not only have a set of cluster centers (as you do with K-Means), but also a variance, describing size of the cluster. For each new observation, you could then compute the probability that it belongs to each cluster without revisiting the data from each cluster (as it's baked in to the MoG EM model).

Related

Why k-means in scikit learn have a predict function but DBSCAN/agglomerative doesnt?

Scikit-learn implementation of K-means has a predict() function which can be applied on unseen data. Where as DBSCAN and Agglomerative does not have a predict() function.
All the three algorithms has fit_predict() which is used to fit the model and then predict. But k-means has predict() which can be directly used on unseen data which is not the case for the other algorithm.
I am very much aware that there are clustering algorithms and as per my opinion, predict() should not be there for K-means also.
What is the possible intuition/reason behind this discrepancy? is it only because k-means performs "1NN classification", so it has a predict() function?
My interpretation is that the difference comes from the way the cluster are computed. In the KMeans there is a native way to assign a new point to a cluster, while not in DBSCAN or Agglomerative clustering.
A) KMeans
In KMeans, during the construction of the clusters, a data point is assigned to the cluster with the closest centroid, and the centroids are updated afterwards. "Predicting" in the KMeans algorithm is actually doing the assignment step without updating the clusters.
If you assume that the new data points are drawn from the same distribution than the "training" set, and that your "training" set was representative enough, it is reasonable to think that one can assign the new data points following the heuristic of the algorithm without updating the cluster centroids, thus making predictions.
Of course, if the data points distribution is likely to be change one should rerun the KMeans clustering on the updated dataset.
B) DBSCAN
DBSCAN creates the cluster by finding high density areas of the dataset (parametrized by the parameters epsilon and min_points). This is done by computing point-level properties (whether the point is a core point, a directly reachable point, a reachable point or a noise point). Adding a new data point can modify the definition of the neighboring points, and thus make the computed clusters obsolete.
As an example, let's look at this illustration from wikipedia, copied below. On this image there is one cluster (red+yellow points) and one noise point (blue). Red points are core points and yellow points are reachable points.
and consider two cases:
Adding a new point halfway between A and N would make N a reachable point from A and thus belonging to the cluster.
Adding (min_points-1) new points in the epsilon-neighborhood of N, but in no other epsilon-neighborhood (as an example at the top of the picture), would change the status of N which would become a core point, and form a new cluster with the newly added points.
Here adding new data points clearly requires to recompute the clusters.
C) Aggglomerative clustering
Agglomerative clustering iteratively builds the cluster starting from points and merges them according to a linkage measure. Similarly to DBSCAN, adding new data points can entirely modify the final clusters because it can trigger different mergings.
As an example, if the linkage strategy you choose in sklearn is "single", clusters are merged if the minimum distance between all elements of the two clusters is below a chosen threshold. You can easily figure out that a single well placed new data point can trigger a merge between two clusters that would have been separated otherwise.
Thus predicting here also requires to recompute the clusters

How to check if a data point is within the boundary of a cluster or not

Suppose I have done clustering (using 3 features) and got 4 clusters, training on a set of data points.
Now in production I will be getting a different set of data points, and based on the values of the features of that data point, I need to know if it falls in the pre-defined cluster that I made earlier or not. This is not doing clustering but rather finding if a point falls within a pre-defined cluster.
How do I find whether the point is in a cluster?
Do I need to run linear regression to find the equation of the boundary covering the cluster?
There is no general answer to your question. The way new point is assigned to a cluster is a property of the cluster itself. Thus the crucial thing is "what is the clustering procedure used in the first place". Each well defined clustering method (in mathematical sense) provides you with the whole input space partitioning, not just finite training set. Such techniques include k-means, GMM, ...
However, there are exceptions - clustering methods which are simply heuristics, and not valid optimization problems. For example if you use hierarchical clustering there is no partitioning of the space, thus you cannot correctly assign new point to any cluser, and you are left with dozens of equally correct, heuristic methods which will do something - but you cannot say which one is correct. These heuristics include:
"closest point heuristics", which is essentialy equivalent of training 1-NN on your clusters
"build a valid model heuristics" which is a generalization of the above where you fit some classifier (of your choice) to mimic the original clustering (and select its hyperparameters through cross validation).
"what would happen if I re-run the clustering", if you can re-run the clustering from the previous solution you can simply check what cluster it falls into given previous clustering as a starting point.
...

DBSCAN using spatial and temporal data

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.
What are my options for incorporating temporal data into the DBSCAN algorithm?
Look up Generalized DBSCAN by the same authors.
Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.
For (Generalized) DBSCAN, you need two functions:
findNeighbors - get all "related" objects from your database
corePoint - decide whether this set is enough to start a cluster
then you can repeatedly find neighbors to grow the clusters.
Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).
tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.
DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.
Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.
IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.
more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

Kmeans2 returning the ordered centroids list based on how many data points they correspond to

I am using the Opencv Kmeans2() clustering function on some data set. Practically the cluster centroids start by being initialized randomly. Then clustering precedes.
How do I request all those centroids in a certain order, say first cetroid of the largest chunk of the data set, then next (in decreasing order, assuming that there's no equal set of different point locations in multidimensional space) and so on ?
Why don't you order them as you like yourself?
For most CV applications, this is not necessary, and you can do it as easily yourself, I guess. It's not as if there was an optimized k-means that returned clusters in a particular order. As k-means is initialized randomly, there is no "natural" order of the clusters. And as you already noted, there can be ties, both in cluster assignment and cluster sizes.
I am planning to count up the labels and see to which of the centroid cluster belongs the most of the data set

Resources