I have run modulartiy edge_weight/randomized at a resolution of 1, atleast 20 times on the same network. This is the same network I have created based on the following rule. Two nodes are related if they have atleast one item in common. Every time I run modularity I get a little different node distribution among communities. Additionally, I get 9 or 10 communities but it is never consistent. Any comment or help is much appreciated.
I found a solution to my problem using consensus clustering. Here is the paper that describes it. One way to get the optimum clusters without having to solve them in a high-dimensional space using spectral clustering would be to run the algorithm repeatedly until no more partitions can be achieved. Here is the article and complete explanation details:
SCIENTIFIC REPORTS | 2 : 336 | DOI: 10.1038/srep00336
Consensus clustering in complex networks Andrea Lancichinetti & Santo Fortunato
The consensus matrix. Let us suppose that we wish to combine nP partitions found by a clustering algorithm on a network with n vertices. The consensus matrix D is an n x n matrix, whose entry Dij indicates the number of partitions in which vertices i and j of the network were assigned to the same cluster, divided by the number of partitions nP. The matrix D is usually much denser than the adjacency matrix A of the original network, because in the consensus matrix there is an edge between any two vertices which have cooccurred in the same cluster at least once. On the other hand, the weights are large only for those vertices which are most frequently coclustered, whereas low weights indicate that the vertices are probably at the boundary between different (real) clusters, so their classification in the same cluster is unlikely and essentially due to noise. We wish to maintain the large weights and to drop the low ones, therefore a filtering procedure is in order. Among the other things, in the absence of filtering the consensus matrix would quickly grow into a very dense matrix, which would make the application of any clustering algorithm computationally expensive.We discard all entries of D below a threshold t. We stress that there might be some noisy vertices whose edges could all be below the threshold, and they would be not connected anymore. When this happens, we just connect them to their neighbors with highest weights, to keep the graph connected all along the procedure.
Next we apply the same clustering algorithm to D and produce another set of partitions, which is then used to construct a new consensus matrix D9, as described above. The procedure is iterated until the consensus matrix turns into a block diagonal matrix Dfinal, whose weights equal 1 for vertices in the same block and 0 for vertices in different blocks. The matrix Dfinal delivers the community structure of the original network. In our calculations typically one iteration is sufficient to lead to stable results. We remark that in order to use the same clustering method all along, the latter has to be able to detect clusters in weighted networks, since the consensus matrix is weighted. This is a necessary constraint on the choice of the methods for which one could use the procedure proposed here. However, it is not a severe limitation,as most clustering algorithms in the literature can handle weighted networks or can be trivially extended to deal with them.
I think that the answer is in the randomizing part of the algorithm. You can find more details here:
https://github.com/gephi/gephi/wiki/Modularity
https://sites.google.com/site/findcommunities/
http://lanl.arxiv.org/abs/0803.0476
Related
Scikit-learn implementation of K-means has a predict() function which can be applied on unseen data. Where as DBSCAN and Agglomerative does not have a predict() function.
All the three algorithms has fit_predict() which is used to fit the model and then predict. But k-means has predict() which can be directly used on unseen data which is not the case for the other algorithm.
I am very much aware that there are clustering algorithms and as per my opinion, predict() should not be there for K-means also.
What is the possible intuition/reason behind this discrepancy? is it only because k-means performs "1NN classification", so it has a predict() function?
My interpretation is that the difference comes from the way the cluster are computed. In the KMeans there is a native way to assign a new point to a cluster, while not in DBSCAN or Agglomerative clustering.
A) KMeans
In KMeans, during the construction of the clusters, a data point is assigned to the cluster with the closest centroid, and the centroids are updated afterwards. "Predicting" in the KMeans algorithm is actually doing the assignment step without updating the clusters.
If you assume that the new data points are drawn from the same distribution than the "training" set, and that your "training" set was representative enough, it is reasonable to think that one can assign the new data points following the heuristic of the algorithm without updating the cluster centroids, thus making predictions.
Of course, if the data points distribution is likely to be change one should rerun the KMeans clustering on the updated dataset.
B) DBSCAN
DBSCAN creates the cluster by finding high density areas of the dataset (parametrized by the parameters epsilon and min_points). This is done by computing point-level properties (whether the point is a core point, a directly reachable point, a reachable point or a noise point). Adding a new data point can modify the definition of the neighboring points, and thus make the computed clusters obsolete.
As an example, let's look at this illustration from wikipedia, copied below. On this image there is one cluster (red+yellow points) and one noise point (blue). Red points are core points and yellow points are reachable points.
and consider two cases:
Adding a new point halfway between A and N would make N a reachable point from A and thus belonging to the cluster.
Adding (min_points-1) new points in the epsilon-neighborhood of N, but in no other epsilon-neighborhood (as an example at the top of the picture), would change the status of N which would become a core point, and form a new cluster with the newly added points.
Here adding new data points clearly requires to recompute the clusters.
C) Aggglomerative clustering
Agglomerative clustering iteratively builds the cluster starting from points and merges them according to a linkage measure. Similarly to DBSCAN, adding new data points can entirely modify the final clusters because it can trigger different mergings.
As an example, if the linkage strategy you choose in sklearn is "single", clusters are merged if the minimum distance between all elements of the two clusters is below a chosen threshold. You can easily figure out that a single well placed new data point can trigger a merge between two clusters that would have been separated otherwise.
Thus predicting here also requires to recompute the clusters
I have an undirected weighted graph. Let's say node A and node B don't have a direct link between them but there are paths connects both nodes through other intermediate nodes. Now I want to predict the possible weight of the direct link between the node A and B as well as the probability of it.
I can predict the weight by finding the possible paths and their average weight but how can I find the probability of it
The problem you are describing is called link prediction. Here is a short tutorial explaining about the problem and some simple heuristics that can be used to solve it.
Since this is an open-ended problem, these simple solutions can be improved a lot by using more complicated techniques. Another approach for predicting the probability for an edge is to use Machine Learning rather than rule-based heuristics.
A recent article called node2vec, proposed an algorithm that maps each node in a graph to a dense vector (aka embedding). Then, by applying some binary operator on a pair of nodes, we get an edge representation (another vector). This vector is then used as input features to some classifier that predicts the edge-probability. The paper compared a few such binary operators over a few different datasets, and significantly outperformed the heuristic benchmark scores across all of these datasets.
The code to compute embeddings given your graph can be found here.
A traditional unsupervised learning approaches normally needs to assign number of clustering (K) before computing, but what if I do not know the exact number of K and exclude the k out of algorithm, I mean, Is there any unsupervised learning algorithm that do not need assign any k, so we can get k clustering automatically?
Affinity propagation
DBSCAN
Mean shift
For more details, check scikit-learn docs here.
You could try to infer the amount of clusters by metrics such as Akaike information criterion, Bayes information criterion, using the Silhouette or the Elbow. I've also heard people talk about automatic clustering methods based on self-organizing maps (SOM), but you'd have to do your own research there.
In my experience it usually just boils down to exploring the data with manifold methods such as t-SNE and/or density based methods such as DBSCAN and then setting k either manually or with a suitable heuristic.
There is a hierarchical clustering in graph's theory. You can achieve clustering either bottom up or top down.
Bottom up
define distance metric (Euclidean, Manhattan...)
start with each point in its own cluster
merge closest two clusters
There are three ways to select closest cluster:
complete link -> two clusters with the smallest maximum pairwise distance
single link -> two clusters with the smallest minimum pairwise distance
average link -> average distance between all pairwise distances
Single linkage clustering can be solved with Kruskalov minimum spanning tree algorithm, however while easy to understand it runs in O(n^3). There is a variation of Prim's algorithm for MST which can solve this in O(nˇ2).
Top-down aka Divisive Analysis
Start with all points in the same cluster and divide clusters at each iteration.
divisive analysis.
There are other clustering algorithms which you may google up, some already mentioned in other answers. I have not used others so i will leave that out.
I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers.
I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.
What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the lack of a squared term allow k-medoids to have the desirable properties associated with the concept of taking a median?
1. K-medoid is more flexible
First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.
2. Robustness of medoid
Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.
Consider this 1-dimensional example:
[1, 2, 3, 4, 100000]
Both the median and medoid of this set are 3. The mean is 20002.
Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...
Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).
I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.
3. k-medoids is much more expensive
That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.
I think this has to do with the selection of the center for the cluster. k-means will select the "center" of the cluster, while k-medoid will select the "most centered" member of the cluster.
In a cluster with outliers (i.e. points far away from the other members of the cluster) k-means will place the center of the cluster towards the outliers, whereas k-medoid will select one of the more clustered members (the medoid) as the center.
It now depends on what you use clustering for. If you just wanted to classify a bunch of objects then you don't really care about where the center is; but if the clustering was used to train a decider which will now classify new objects based on those center points, then k-medoid will give you a center closer to where a human would place the center.
In wikipedia's words:
"It [k-medoid] is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances."
Here's an example:
Suppose you want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000.
It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster.
k-medoid will select one of the members around 1000 as the medoid, it'll probably select one that is bigger than 1000, but it will not select an outlier.
Just a tiny note added to #Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.
Suppose you have five 2D points in one cluster with the coordinates of (1,1),(1,2),(2,1),(2,2), and (100,100). If we don't consider the object exchanges among the clusters, with k-means you will get the center of cluster (21.2,21.2) which is pretty distracted by the point (100,100). However, with k-medoid will choose the center among (1,1),(1,2),(2,1),and (2,2) according to its algorithm.
Here is a fun applet ( E.M. Mirkes, K-means and K-medoids applet. University of Leicester, 2011 ) that you can randomly generate dataset in the 2D plane and compare k-medoid and k-means learning process.
If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.