What is the general convention for number of k, while performing k-means on KDD99 dataset? Three different papers I read have three completely different k (25,20 and 5). I would like to know the general opinion on this, like what should be the range of k e.t.c?
Thanks
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data.
I general there is no method for determining the exact value for K, but an estimated approach can be used to determine it.
To find K, take the mean distance between data points and their cluster centroid.
The elbow method and kernel method works more precisely, but the number of clusters can depend upon your problem. (Recommended)
And one of the quick approaches is:-Take the square root of the number of data points divided by two and set that as number of cluster.
Related
A traditional unsupervised learning approaches normally needs to assign number of clustering (K) before computing, but what if I do not know the exact number of K and exclude the k out of algorithm, I mean, Is there any unsupervised learning algorithm that do not need assign any k, so we can get k clustering automatically?
Affinity propagation
DBSCAN
Mean shift
For more details, check scikit-learn docs here.
You could try to infer the amount of clusters by metrics such as Akaike information criterion, Bayes information criterion, using the Silhouette or the Elbow. I've also heard people talk about automatic clustering methods based on self-organizing maps (SOM), but you'd have to do your own research there.
In my experience it usually just boils down to exploring the data with manifold methods such as t-SNE and/or density based methods such as DBSCAN and then setting k either manually or with a suitable heuristic.
There is a hierarchical clustering in graph's theory. You can achieve clustering either bottom up or top down.
Bottom up
define distance metric (Euclidean, Manhattan...)
start with each point in its own cluster
merge closest two clusters
There are three ways to select closest cluster:
complete link -> two clusters with the smallest maximum pairwise distance
single link -> two clusters with the smallest minimum pairwise distance
average link -> average distance between all pairwise distances
Single linkage clustering can be solved with Kruskalov minimum spanning tree algorithm, however while easy to understand it runs in O(n^3). There is a variation of Prim's algorithm for MST which can solve this in O(nˇ2).
Top-down aka Divisive Analysis
Start with all points in the same cluster and divide clusters at each iteration.
divisive analysis.
There are other clustering algorithms which you may google up, some already mentioned in other answers. I have not used others so i will leave that out.
I have read some resources and I found out how hierarchical clustering works. However, when I compare it with k-means clustering, it seems to me that k-means really constitues specific number of clusters,whereas hierarchical analysis shows me how the samples can be clustered. What I mean is that I do not get a specific number of clusters in hierarchical clustering. I get only a scheme about how the clusters can be constituted and portion of relation between the samples.
Thus, I cannot understand where I can use this clustering method.
Hierarchical clustering (HC) is just another distance-based clustering method like k-means. The number of clusters can be roughly determined by cutting the dendrogram represented by HC. Determining the number of clusters in a data set is not an easy task for all clustering methods, which is usually based on your applications. Tuning the thresholds in HC may be more explicit and straightforward for researchers, especially for a very large data set. I think this question is also related.
In k-means clustering k is a hyperparameter that you need to find in order to divide your data points into clusters whereas in hierarchical clustering (lets take one type of hierarchical clustering i.e. agglomerative) firstly you consider all the points in your dataset as a cluster and then merge two clusters based on a similarity metric and repeat this until you get a single cluster. I will explain this with an example.
Suppose initially you have 13 points (x_1,x_2,....,x_13) in your dataset so at start you will have 13 clusters, now in second step lets you get 7 clusters (x_1-x_2 , x_4-x_5, x_6-x_8, x_3-x_7, x_11-x_12, x_10, x_13) based on the similarity between the points. In the third step lets say you get 4 clusters(x_1-x_2-x_4-x_5, x_6-x_8-x_10, x_3-x_7-x_13, x_11-x_12) like this you would arrive to a step wherein all the points in your dataset form one cluster and which is also the last step of agglomerative clustering algorithm.
So in hierarchical clustering, there is no hyperparameter, depending upon your problem, if you want 7 clusters then stop at the second step if you want 4 clusters then stop at the third step and likewise.
A practical advantage in hierarchical clustering is the possibility of visualizing results using dendrogram. If you don’t know in advance what number of clusters you’re looking for (as is often the case…), you can use the dendrogram plot that can help you choose k with no need to create separate clusterings. Dendrogram can also give a great insight into the data structure, help identify outliers, etc. Hierarchical clustering is also deterministic, whereas k-means with random initialization can give you different results when running several times on the same data.
Hope this helps.
I am running a k-means algorithm in R and trying to find the optimal number of clusters, k. Using the the silhouette method, the gap statistic, and the elbow method, I determined that the optimal number of clusters is 2. While there are no predefined clusters for the business, I am concerned that k=2 is not too insightful, which leads me to a few questions.
1) What does an optimal k = 2 mean in terms of the data's natural clustering? Does this suggest that maybe there are no clear clusters or that no clusters are better than any clusters?
2) At k = 2, the R-squared is low (.1). At k = 5, the R-squared is much better (.32). What are the exact trade offs on selecting k = 5 knowing it's not optimal? Would it be that you can increase the clusters, but they may not be distinct enough?
3) My n=1000, I have 100 variables to choose from, but only selected 5 from domain knowledge. Would increasing the number of variables necessarily make the clustering better?
4) As a follow up to question 3, if a variable is introduced and lowers the R-squared, what does that say about the variable?
I am no expert but I will try to answer as best as I can:
1) Your optimal cluster number methods gave you k=2 so that would suggest there is clear clustering the number is just low (2). To help with this try using your knowledge of the domain to help with the interpretation, does 2 clusters make sense given your domain?
2) Yes you're correct. The optimal solution in terms of R-squared is to have as many clusters as data points, however this isn't optimal in terms of why you're doing k-means. You're doing k-means to gain more insightful information from the data, this is you're primary goal. As such if you choose k=5 you're data will fit your 5 clusters better but as you say there probably isn't much distinction between them so you're not gaining any insight.
3) Not necessarily, in fact adding blindly could make it worse. K-means operates in euclidean space so every variable is given an even weighting in determining the clusters. If you add variables that are not relevant their values will still distort the n-d space making your clusters worse.
4) (Double check my logic here i'm not 100% on this one) If a variable is introduced to the same number of clusters and it drops the R-squared then yes it is a useful variable to add, it means it has correlation with your other variables.
I have run modulartiy edge_weight/randomized at a resolution of 1, atleast 20 times on the same network. This is the same network I have created based on the following rule. Two nodes are related if they have atleast one item in common. Every time I run modularity I get a little different node distribution among communities. Additionally, I get 9 or 10 communities but it is never consistent. Any comment or help is much appreciated.
I found a solution to my problem using consensus clustering. Here is the paper that describes it. One way to get the optimum clusters without having to solve them in a high-dimensional space using spectral clustering would be to run the algorithm repeatedly until no more partitions can be achieved. Here is the article and complete explanation details:
SCIENTIFIC REPORTS | 2 : 336 | DOI: 10.1038/srep00336
Consensus clustering in complex networks Andrea Lancichinetti & Santo Fortunato
The consensus matrix. Let us suppose that we wish to combine nP partitions found by a clustering algorithm on a network with n vertices. The consensus matrix D is an n x n matrix, whose entry Dij indicates the number of partitions in which vertices i and j of the network were assigned to the same cluster, divided by the number of partitions nP. The matrix D is usually much denser than the adjacency matrix A of the original network, because in the consensus matrix there is an edge between any two vertices which have cooccurred in the same cluster at least once. On the other hand, the weights are large only for those vertices which are most frequently coclustered, whereas low weights indicate that the vertices are probably at the boundary between different (real) clusters, so their classification in the same cluster is unlikely and essentially due to noise. We wish to maintain the large weights and to drop the low ones, therefore a filtering procedure is in order. Among the other things, in the absence of filtering the consensus matrix would quickly grow into a very dense matrix, which would make the application of any clustering algorithm computationally expensive.We discard all entries of D below a threshold t. We stress that there might be some noisy vertices whose edges could all be below the threshold, and they would be not connected anymore. When this happens, we just connect them to their neighbors with highest weights, to keep the graph connected all along the procedure.
Next we apply the same clustering algorithm to D and produce another set of partitions, which is then used to construct a new consensus matrix D9, as described above. The procedure is iterated until the consensus matrix turns into a block diagonal matrix Dfinal, whose weights equal 1 for vertices in the same block and 0 for vertices in different blocks. The matrix Dfinal delivers the community structure of the original network. In our calculations typically one iteration is sufficient to lead to stable results. We remark that in order to use the same clustering method all along, the latter has to be able to detect clusters in weighted networks, since the consensus matrix is weighted. This is a necessary constraint on the choice of the methods for which one could use the procedure proposed here. However, it is not a severe limitation,as most clustering algorithms in the literature can handle weighted networks or can be trivially extended to deal with them.
I think that the answer is in the randomizing part of the algorithm. You can find more details here:
https://github.com/gephi/gephi/wiki/Modularity
https://sites.google.com/site/findcommunities/
http://lanl.arxiv.org/abs/0803.0476
Looking at the Histogram Documentation, there are 4(5) different comparison methods:
CV_COMP_CORREL Correlation
CV_COMP_CHISQR Chi-Square
CV_COMP_INTERSECT Intersection
CV_COMP_BHATTACHARYYA Bhattacharyya distance
CV_COMP_HELLINGER Synonym for CV_COMP_BHATTACHARYYA
They all give different outputs that are read differently as shown in the Compare Histogram Documentation. But I can't find anything that states how effectively each method performs compared against each other. Surely there are Pros and Cons for each method, otherwise why have multiple methods?
Even the OpenCV 2 Computer Vision Application Programming Cookbook has very little to say on the differnces:
The call to cv::compareHist is straightforward. You just input the two
histograms and the function returns the measured distance. The
specific measurement method you want to use is specified using a flag.
In the ImageComparator class, the intersection method is used (with
flag CV_COMP_INTERSECT). This method simply compares, for each bin,
the two values in each histogram, and keeps the minimum one. The
similarity measure is then simply the sum of these minimum values.
Consequently, two images having histograms with no colors in common
would get an intersection value of 0, while two identical histograms
would get a value equal to the total number of pixels.
The other methods available are the Chi-Square (flag CV_COMP_CHISQR)
which sums the normalized square difference between the bins, the
correlation method (flag CV_COMP_CORREL) which is based on the
normalized cross-correlation operator used in signal processing to
measure the similarity between two signals, and the Bhattacharyya
measure (flag CV_COMP_BHATTACHARYYA) used in statistics to estimate
the similarity between two probabilistic distributions.
There must be differences between the methods, so my question is what are they? and under what circumstances do they work best?
CV_COMP_INTERSECT is fast to compute since you just need the minimum value for each bin. But it will not tell you much about the distribution of the differences. Other methods try to achieve better and more continuous score as a match, under different assumptions about the pixel distribution.
You can find the formulae used in different methods, at
http://docs.opencv.org/doc/tutorials/imgproc/histograms/histogram_comparison/histogram_comparison.html
Some references to more details on the matching algorithms can be found at:
http://siri.lmao.sk/fiit/DSO/Prednasky/7%20a%20Histogram%20based%20methods/7%20a%20Histogram%20based%20methods.pdf