What makes the distance measure in k-medoid "better" than k-means? - machine-learning

I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers.
I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.
What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the lack of a squared term allow k-medoids to have the desirable properties associated with the concept of taking a median?

1. K-medoid is more flexible
First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.
2. Robustness of medoid
Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.
Consider this 1-dimensional example:
[1, 2, 3, 4, 100000]
Both the median and medoid of this set are 3. The mean is 20002.
Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...
Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).
I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.
3. k-medoids is much more expensive
That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.

I think this has to do with the selection of the center for the cluster. k-means will select the "center" of the cluster, while k-medoid will select the "most centered" member of the cluster.
In a cluster with outliers (i.e. points far away from the other members of the cluster) k-means will place the center of the cluster towards the outliers, whereas k-medoid will select one of the more clustered members (the medoid) as the center.
It now depends on what you use clustering for. If you just wanted to classify a bunch of objects then you don't really care about where the center is; but if the clustering was used to train a decider which will now classify new objects based on those center points, then k-medoid will give you a center closer to where a human would place the center.
In wikipedia's words:
"It [k-medoid] is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances."
Here's an example:
Suppose you want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000.
It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster.
k-medoid will select one of the members around 1000 as the medoid, it'll probably select one that is bigger than 1000, but it will not select an outlier.

Just a tiny note added to #Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.
Suppose you have five 2D points in one cluster with the coordinates of (1,1),(1,2),(2,1),(2,2), and (100,100). If we don't consider the object exchanges among the clusters, with k-means you will get the center of cluster (21.2,21.2) which is pretty distracted by the point (100,100). However, with k-medoid will choose the center among (1,1),(1,2),(2,1),and (2,2) according to its algorithm.
Here is a fun applet ( E.M. Mirkes, K-means and K-medoids applet. University of Leicester, 2011 ) that you can randomly generate dataset in the 2D plane and compare k-medoid and k-means learning process.

Related

How to select features for clustering?

I had time-series data, which I have aggregated into 3 weeks and transposed to features.
Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on.
Some of features are discreet, some - continuous.
I am thinking of applying K-Means or DBSCAN.
How should I approach the feature selection in such situation?
Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?
Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.
Formalize your problem, don't just hack some code.
K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).
For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.
Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:
maxdistance(x,y) <= eps
<=>
forall_i |x_i-y_i| <= eps

What “information” in document vectors makes sentiment prediction work?

Sentiment prediction based on document vectors works pretty well, as examples show:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://linanqiu.github.io/2015/10/07/word2vec-sentiment/
I wonder what pattern is in the vectors making that possible. I thought it should be similarity of vectors making that somehow possible. Gensim similarity measures rely on cosine similarity. Therefore, I tried the following:
Randomly initialised a fix “compare” vector, get cosine similarity of the “compare” vector with all other vectors in training and test set, use the similarities and the labels of the train set to estimate a logistic regression model, evaluate the model with the test set.
Looks like this, where train/test_arrays contain document vectors and train/test_labels labels either 0 or 1. (Notice, document vectors are obtained from genism doc2vec and are well trained, predicting the test set 80% right if directly used as input for the logistic regression):
fix_vec = numpy.random.rand(100,1)
def cos_distance_to_fix(x):
return scipy.spatial.distance.cosine(fix_vec, x)
train_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=train_arrays), newshape=(-1,1))
test_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=test_arrays), newshape=(-1,1))
classifier = LogisticRegression()
classifier.fit(train_arrays_cos, train_labels)
classifier.score(test_arrays_cos, test_labels)
It turns out, that this approach does not work, predicting the test set only to 50%....
So, my question is, what “information” is in the vectors, making the prediction based on vectors work, if it is not the similarity of vectors? Or is my approach simply not possible to capture similarity of vectors correct?
This is less a question about Doc2Vec than about machine-learning principles with high-dimensional data.
Your approach is collapsing 100-dimensions to a single dimension – the distance to your random point. Then, you're hoping that single dimension can still be predictive.
And roughly all LogisticRegression can do with that single-valued input is try to pick a threshold-number that, when your distance is on one side of that threshold, predicts a class – and on the other side, predicts not-that-class.
Recasting that single-threshold-distance back to the original 100-dimensional space, it's essentially trying to find a hypersphere, around your random point, that does a good job collecting all of a single class either inside or outside its volume.
What are the odds your randomly-placed center-point, plus one adjustable radius, can do that well, in a complex high-dimensional space? My hunch is: not a lot. And your results, no better than random guessing, seems to suggest the same.
The LogisticRegression with access to the full 100-dimensions finds a discriminating-frontier for assigning the class that's described by 100 coefficients and one intercept-value – and all of those 101 values (free parameters) can be adjusted to improve its classification performance.
In comparison, your alternative LogisticRegression with access to only the one 'distance-from-a-random-point' dimension can pick just one coefficient (for the distance) and an intercept/bias. It's got 1/100th as much information to work with, and only 2 free parameters to adjust.
As an analogy, consider a much simpler space: the surface of the Earth. Pick a 'random' point, like say the South Pole. If I then tell you that you are in an unknown place 8900 miles from the South Pole, can you answer whether you are more likely in the USA or China? Hardly – both of those 'classes' of location have lots of instances 8900 miles from the South Pole.
Only in the extremes will the distance tell you for sure which class (country) you're in – because there are parts of the USA's Alaska and Hawaii further north and south than parts of China. But even there, you can't manage well with just a single threshold: you'd need a rule which says, "less than X or greater than Y, in USA; otherwise unknown".
The 100-dimensional space of Doc2Vec vectors (or other rich data sources) will often only be sensibly divided by far more complicated rules. And, our intuitions about distances and volumes based on 2- or 3-dimensional spaces will often lead us astray, in high dimensions.
Still, the Earth analogy does suggest a way forward: there are some reference points on the globe that will work way better, when you know the distance to them, at deciding if you're in the USA or China. In particular, a point at the center of the US, or at the center of China, would work really well.
Similarly, you may get somewhat better classification accuracy if rather than a random fix_vec, you pick either (a) any point for which a class is already known; or (b) some average of all known points of one class. In either case, your fix_vec is then likely to be "in a neighborhood" of similar examples, rather than some random spot (that has no more essential relationship to your classes than the South Pole has to northern-Hemisphere temperate-zone countries).
(Also: alternatively picking N multiple random points, and then feeding the N distances to your regression, will preserve more of the information/shape of the original Doc2Vec data, and thus give the classifier a better chance of finding a useful separating-threshold. Two would likely do better than your one distance, and 100 might approach or surpass the 100 original dimensions.)
Finally, some comment about the Doc2Vec aspect:
Doc2Vec optimizes vectors that are somewhat-good, within their constrained model, at predicting the words of a text. Positive-sentiment words tend to occur together, as do negative-sentiment words, and so the trained doc-vectors tend to arrange themselves in similar positions when they need to predict similar-meaning-words. So there are likely to be 'neighborhoods' of the doc-vector space that correlate well with predominantly positive-sentiment or negative-sentiment words, and thus positive or negative sentiments.
These won't necessarily be two giant neighborhoods, 'positive' and 'negative', separated by a simple boundary –or even a small number of neighborhoods matching our ideas of 3-D solid volumes. And many subtleties of communication – such as sarcasm, referencing a not-held opinion to critique it, spending more time on negative aspects but ultimately concluding positive, etc – mean incursions of alternate-sentiment words into texts. A fully-language-comprehending human agent could understand these to conclude the 'true' sentiment, while these word-occurrence based methods will still be confused.
But with an adequate model, and the right number of free parameters, a classifier might capture some generalizable insight about the high-dimensional space. In that case, you can achieve reasonably-good predictions, using the Doc2Vec dimensions – as you've seen with the ~80%+ results on the full 100-dimensional vectors.

K-Medoids Cluster Analysis

What are some analysis functions which can be used on the K-Medoids algorithms?
My main aim is to compare results of 2 different clustering results in order to see which is better.
Can SSE (sum of squared errors) be applied to K-Medoids algorithm?
The original k-medoid publication discusses the measures ESS, along with several other measures such as average dissimilarity, maximum dissimilarity, diameter that may be more appropriate to use.
SSE is closely related to Euclidean distance, so it usually is not appropriate (unless, of course, you use Euclidean; but why would you use k-medoids then instead of k-means?)
ARI, NMI, and Silhouette Coefficient can be used to compare the results

Iterative modularity

I have run modulartiy edge_weight/randomized at a resolution of 1, atleast 20 times on the same network. This is the same network I have created based on the following rule. Two nodes are related if they have atleast one item in common. Every time I run modularity I get a little different node distribution among communities. Additionally, I get 9 or 10 communities but it is never consistent. Any comment or help is much appreciated.
I found a solution to my problem using consensus clustering. Here is the paper that describes it. One way to get the optimum clusters without having to solve them in a high-dimensional space using spectral clustering would be to run the algorithm repeatedly until no more partitions can be achieved. Here is the article and complete explanation details:
SCIENTIFIC REPORTS | 2 : 336 | DOI: 10.1038/srep00336
Consensus clustering in complex networks Andrea Lancichinetti & Santo Fortunato
The consensus matrix. Let us suppose that we wish to combine nP partitions found by a clustering algorithm on a network with n vertices. The consensus matrix D is an n x n matrix, whose entry Dij indicates the number of partitions in which vertices i and j of the network were assigned to the same cluster, divided by the number of partitions nP. The matrix D is usually much denser than the adjacency matrix A of the original network, because in the consensus matrix there is an edge between any two vertices which have cooccurred in the same cluster at least once. On the other hand, the weights are large only for those vertices which are most frequently coclustered, whereas low weights indicate that the vertices are probably at the boundary between different (real) clusters, so their classification in the same cluster is unlikely and essentially due to noise. We wish to maintain the large weights and to drop the low ones, therefore a filtering procedure is in order. Among the other things, in the absence of filtering the consensus matrix would quickly grow into a very dense matrix, which would make the application of any clustering algorithm computationally expensive.We discard all entries of D below a threshold t. We stress that there might be some noisy vertices whose edges could all be below the threshold, and they would be not connected anymore. When this happens, we just connect them to their neighbors with highest weights, to keep the graph connected all along the procedure.
Next we apply the same clustering algorithm to D and produce another set of partitions, which is then used to construct a new consensus matrix D9, as described above. The procedure is iterated until the consensus matrix turns into a block diagonal matrix Dfinal, whose weights equal 1 for vertices in the same block and 0 for vertices in different blocks. The matrix Dfinal delivers the community structure of the original network. In our calculations typically one iteration is sufficient to lead to stable results. We remark that in order to use the same clustering method all along, the latter has to be able to detect clusters in weighted networks, since the consensus matrix is weighted. This is a necessary constraint on the choice of the methods for which one could use the procedure proposed here. However, it is not a severe limitation,as most clustering algorithms in the literature can handle weighted networks or can be trivially extended to deal with them.
I think that the answer is in the randomizing part of the algorithm. You can find more details here:
https://github.com/gephi/gephi/wiki/Modularity
https://sites.google.com/site/findcommunities/
http://lanl.arxiv.org/abs/0803.0476

DBSCAN using spatial and temporal data

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.
What are my options for incorporating temporal data into the DBSCAN algorithm?
Look up Generalized DBSCAN by the same authors.
Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.
For (Generalized) DBSCAN, you need two functions:
findNeighbors - get all "related" objects from your database
corePoint - decide whether this set is enough to start a cluster
then you can repeatedly find neighbors to grow the clusters.
Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).
tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.
DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.
Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.
IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.
more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

Resources