Customized k-means clustering algorithm - machine-learning

In my problem, I want to create clusters of closely located nodes (servers). I need to create clusters in a way that the number of nodes in a cluster must fulfill the required processing resource. According to my understanding, k-means clustering can be one solution. But the only concerning point is that the k-means algorithm groups closely located objects in a cluster without putting any lower or upper limit on the number of nodes the cluster contains. But in my problem I have to put the limit in terms of required compute resource. Each cluster must contain at-min the number of nodes whose collective processing resource must be greater than or equal to required compute resource. Can I customize the k-means in a way that for each cluster creation, it keeps on adding the nodes in a cluster till the required processing resource criteria meets. Please also guide, if there is any other clustering technique that suits my problem.

Related

Hierarchical Clustering

I have read some resources and I found out how hierarchical clustering works. However, when I compare it with k-means clustering, it seems to me that k-means really constitues specific number of clusters,whereas hierarchical analysis shows me how the samples can be clustered. What I mean is that I do not get a specific number of clusters in hierarchical clustering. I get only a scheme about how the clusters can be constituted and portion of relation between the samples.
Thus, I cannot understand where I can use this clustering method.
Hierarchical clustering (HC) is just another distance-based clustering method like k-means. The number of clusters can be roughly determined by cutting the dendrogram represented by HC. Determining the number of clusters in a data set is not an easy task for all clustering methods, which is usually based on your applications. Tuning the thresholds in HC may be more explicit and straightforward for researchers, especially for a very large data set. I think this question is also related.
In k-means clustering k is a hyperparameter that you need to find in order to divide your data points into clusters whereas in hierarchical clustering (lets take one type of hierarchical clustering i.e. agglomerative) firstly you consider all the points in your dataset as a cluster and then merge two clusters based on a similarity metric and repeat this until you get a single cluster. I will explain this with an example.
Suppose initially you have 13 points (x_1,x_2,....,x_13) in your dataset so at start you will have 13 clusters, now in second step lets you get 7 clusters (x_1-x_2 , x_4-x_5, x_6-x_8, x_3-x_7, x_11-x_12, x_10, x_13) based on the similarity between the points. In the third step lets say you get 4 clusters(x_1-x_2-x_4-x_5, x_6-x_8-x_10, x_3-x_7-x_13, x_11-x_12) like this you would arrive to a step wherein all the points in your dataset form one cluster and which is also the last step of agglomerative clustering algorithm.
So in hierarchical clustering, there is no hyperparameter, depending upon your problem, if you want 7 clusters then stop at the second step if you want 4 clusters then stop at the third step and likewise.
A practical advantage in hierarchical clustering is the possibility of visualizing results using dendrogram. If you don’t know in advance what number of clusters you’re looking for (as is often the case…), you can use the dendrogram plot that can help you choose k with no need to create separate clusterings. Dendrogram can also give a great insight into the data structure, help identify outliers, etc. Hierarchical clustering is also deterministic, whereas k-means with random initialization can give you different results when running several times on the same data.
Hope this helps.

How to check if a data point is within the boundary of a cluster or not

Suppose I have done clustering (using 3 features) and got 4 clusters, training on a set of data points.
Now in production I will be getting a different set of data points, and based on the values of the features of that data point, I need to know if it falls in the pre-defined cluster that I made earlier or not. This is not doing clustering but rather finding if a point falls within a pre-defined cluster.
How do I find whether the point is in a cluster?
Do I need to run linear regression to find the equation of the boundary covering the cluster?
There is no general answer to your question. The way new point is assigned to a cluster is a property of the cluster itself. Thus the crucial thing is "what is the clustering procedure used in the first place". Each well defined clustering method (in mathematical sense) provides you with the whole input space partitioning, not just finite training set. Such techniques include k-means, GMM, ...
However, there are exceptions - clustering methods which are simply heuristics, and not valid optimization problems. For example if you use hierarchical clustering there is no partitioning of the space, thus you cannot correctly assign new point to any cluser, and you are left with dozens of equally correct, heuristic methods which will do something - but you cannot say which one is correct. These heuristics include:
"closest point heuristics", which is essentialy equivalent of training 1-NN on your clusters
"build a valid model heuristics" which is a generalization of the above where you fit some classifier (of your choice) to mimic the original clustering (and select its hyperparameters through cross validation).
"what would happen if I re-run the clustering", if you can re-run the clustering from the previous solution you can simply check what cluster it falls into given previous clustering as a starting point.
...

How to correct for multiple testing with enrichment test?

I run enrichment test to find genes mutated at a different rate within one clusters of samples compared outside it based on a two-tailed Fisher’s exact test.
So finally I have a matrix 5x10 of pvalues.
I wonder how to correct them for multiple testing. Should I correct by genes or by clusters ?
if you have SNP data such that all your clusters have more or less the same p-values then it is my understanding that you should pick a random p-value from each cluster and then use these when you are correcting from multiple testing, and subsequently correct the entire cluster with that p-values correction value, if there is large differences within the cluster I think you should correct by gene.

Cluster data with output centers of Kmeans function

Hi I have clustered some data with kmeans function and stored the centers of clusters that it produces as output. Now I have a new set of vectors in a Mat object and want to know to which cluster each vector belongs in. Is there a simple way to do that or should I just calculate the euclidean distances of each vector with all the centers and choose the cluster it is closest to.
If I should go for the second way, are there any efficiency considerations to make it fast?
It seems that you're interested in performing some type of cluster assignment using the results of running K-Means on an initial data set, right?
You could just assign the new observation to the closest mean. Unfortunately with K-Means you don't know anything about the shapes or size of each cluster. For example, consider a scenario where a new vector is equidistant (or roughly equidistant) from two means. What do yo do in this scenario? Do you make a hard assignment to one of the clusters?
In this situation its probably better to actually look at the original data that comprises each of the clusters, and do some type of K-Nearest Neighbor assignment (http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). For example, it may turn out that while the new vector is roughly equidistant from two different cluster centers, it is much closer to the data from one of the clusters (indicating that it likely belongs to that cluster).
As an alternative to K-Means, if you used some like Mixture of Gaussians with EM, you'd not only have a set of cluster centers (as you do with K-Means), but also a variance, describing size of the cluster. For each new observation, you could then compute the probability that it belongs to each cluster without revisiting the data from each cluster (as it's baked in to the MoG EM model).

Determining groups in a hierarchical cluster

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Resources