Hierarchical Clustering - machine-learning

I have read some resources and I found out how hierarchical clustering works. However, when I compare it with k-means clustering, it seems to me that k-means really constitues specific number of clusters,whereas hierarchical analysis shows me how the samples can be clustered. What I mean is that I do not get a specific number of clusters in hierarchical clustering. I get only a scheme about how the clusters can be constituted and portion of relation between the samples.
Thus, I cannot understand where I can use this clustering method.

Hierarchical clustering (HC) is just another distance-based clustering method like k-means. The number of clusters can be roughly determined by cutting the dendrogram represented by HC. Determining the number of clusters in a data set is not an easy task for all clustering methods, which is usually based on your applications. Tuning the thresholds in HC may be more explicit and straightforward for researchers, especially for a very large data set. I think this question is also related.

In k-means clustering k is a hyperparameter that you need to find in order to divide your data points into clusters whereas in hierarchical clustering (lets take one type of hierarchical clustering i.e. agglomerative) firstly you consider all the points in your dataset as a cluster and then merge two clusters based on a similarity metric and repeat this until you get a single cluster. I will explain this with an example.
Suppose initially you have 13 points (x_1,x_2,....,x_13) in your dataset so at start you will have 13 clusters, now in second step lets you get 7 clusters (x_1-x_2 , x_4-x_5, x_6-x_8, x_3-x_7, x_11-x_12, x_10, x_13) based on the similarity between the points. In the third step lets say you get 4 clusters(x_1-x_2-x_4-x_5, x_6-x_8-x_10, x_3-x_7-x_13, x_11-x_12) like this you would arrive to a step wherein all the points in your dataset form one cluster and which is also the last step of agglomerative clustering algorithm.
So in hierarchical clustering, there is no hyperparameter, depending upon your problem, if you want 7 clusters then stop at the second step if you want 4 clusters then stop at the third step and likewise.
A practical advantage in hierarchical clustering is the possibility of visualizing results using dendrogram. If you don’t know in advance what number of clusters you’re looking for (as is often the case…), you can use the dendrogram plot that can help you choose k with no need to create separate clusterings. Dendrogram can also give a great insight into the data structure, help identify outliers, etc. Hierarchical clustering is also deterministic, whereas k-means with random initialization can give you different results when running several times on the same data.
Hope this helps.

Related

dimensional time series data clustering

I have a data set which is time-series type and contains three dimensions namely acceleration, speed and grade. I want to apply clustering to identify the clusters that have similar speed (acceleration=0, positive or negative) varying with grade. I do not know what type of clustering should i use, surely k-means cannot help me because there is a serial correlation between my data point because each point is affected by its previous point. Could you please help me with the type of clustering?
Popular time series similarity metrics such as DTW can be implemented for multiple variates the same way as for a single variate. The most challenging part is normalization.
You then can run hierarchical clustering trivially. Do not use KMeans.

More accurate approach than k-mean clustering

In Radial Basis Function Network (RBF Network), all the prototypes (center vectors of the RBF functions) in the hidden layer are chosen. This step can be performed in several ways:
Centers can be randomly sampled from some set of examples.
Or, they can be determined using k-mean clustering.
One of the approaches for making an intelligent selection of prototypes is to perform k-mean clustering on our training set and to use the cluster centers as the prototypes.
All we know that k-mean clustering is caracterized by its simplicity (it is fast) but not very accurate.
That is why I would like know what is the other approach that can be more accurate than k-mean clustering?
Any help will be very appreciated.
Several k-means variations exist: k-medians, Partitioning Around Medoids, Fuzzy C-Means Clustering, Gaussian mixture models trained with expectation-maximization algorithm, k-means++, etc.
I use PAM (Partitioning around Medoid) in order to be more accurate when my dataset contain some "outliers" (noise with value which are very different to the others values) and I don't want the centers to be influenced by this data. In the case of PAM a center is called a Medoid.
There is a more statistical approach to cluster analysis, called the Expectation-Maximization Algorithm. It uses statistical analysis to determine clusters. This is probably a better approach when you have a lot of data regarding your cluster centroids and training data.
This link also lists several other clustering algorithms out there in the wild. Obviously, some are better than others, depending on the amount of data you have and/or the type of data you have.
There is a wonderful course on Udacity, Intro to Artificial Intelligence, where one lesson is dedicated to unsupervised learning, and Professor Thrun explains some clustering algorithms in very great detail. I highly recommend that course!
I hope this helps,
In terms of K-Means, you can run it on your sample a number of times (say, 100) and then choose the clustering (and by consequence the centroids) that has the smallest K-Means criterion output (the sum of the square Euclidean distances between each entity and its respective centroid).
You can also use some initialization algorithms (the intelligent K-Means comes to mind, but you can also google for K-Means++). You can find a very good review of K-Means in a paper by AK Jain called Data clustering: 50 years beyond K-means.
You can also check hierarchical methods, such as the Ward method.

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

How To Fight Randomness Caused By KMeans Clustering

I'm developing an algorithm to classify different types of dogs based off of image data. The steps of the algorithm are:
Go through all training images, detect image features (ie SURF), and extract descriptors. Collect all descriptors for all images.
Cluster within the collected image descriptors and find k "words" or centroids within the collection.
Reiterate through all images, extract SURF descriptors, and match the extracted descriptor with the closest "word" found via clustering.
Represent each image as a histogram of the words found in clustering.
Feed these image representations (feature vectors) to a classifier and train...
Now, I have run into a bit of a problem. Finding the "words" within the collection of image descriptors is a very important step. Due to the random nature of clustering, different clusters are found each time I run my program. The unfortunate result is that sometimes the accuracy of my classifier will be very good, and other times, very bad. I have chalked this up to the clustering algorithm finding "good" words sometimes, and "bad" words other times.
Does anyone know how I can hedge against the clustering algorithm from finding "bad" words? Currently I just cluster several times and take the mean accuracy of my classifier, but there must be a better way.
Thanks for taking time to read through this, and thank you for your help!
EDIT:
I am not using KMeans for classification; I am using a Support Vector Machine for classification. I am using KMeans for finding image descriptor "words", and then using these words to create histograms which describe each image. These histograms serve as feature vectors that are fed to the Support Vector Machine for classification.
There are many possible ways of making clustering repeatable:
The most basic method of dealing with k-means randomness is simply running it multiple times and selecting the best one (the one that minimizes the inner cluster distances/maximizes the between clusters distance).
One can use some fixed initialization for your data instead of randomization. There are many heuristics for starting the k-means. Or at least minimize the variance by using algorithms like k-means++.
Use modification of k-means which guarantees global minimum of regularized function, ie. convex k-means
Use different clustering method, which is deterministic, ie. Data Nets
I would offer two possible suggestions, in addition to those provided.
K-means optimises an objective related to the distance between cluster points and their centroids. You care about classification accuracy. Depending on the computational cost, a simple brute-force approach is to induce multiple clusterings on a subset of your training data, and evaluate the performance of each on some held-out development set for the task you care about. Then use the highest performing variant as the final model. I don't like the use of non-random initialisation because this is only a solution to avoid the randomness, not find the true global minimum of the objective, and your chosen initialisation may be useless and just produce consistently bad classifiers.
The other approach, which is much harder, is to view the k-means step as a dimensionality reduction to enable classification, and incorporate this into the classifier directly. If you use a deep neural net, the layer(s) closest to the input are essentially dimensionality reducers in the same way as the k-means clustering you induce: the difference is their weights are set wrt the error of the net on the classification problem, rather than some unrelated intermediate step. The downside is that this is much closer to a current research problem: training deep nets is hard. You could start with a standard one-hidden-layer architecture (with binary activations on the hidden layer, and using cross-entropy loss on the output layer with outputs coded as one-of-n categories), and attempt to add layers incrementally, but as far as I'm aware standard training algorithms start to behave poorly beyond the single hidden layer, so you'd need to investigate layer-wise training to initialise, or some of the Hessian-Free stuff coming out of Geoff Hinton's group in Toronto.
That is actually an important problem with the BofW approach, and you should share this prominently. SIFT data may actually not have k-means clusters at all. However, due to the nature of the algorithm, k-means will always produce k clusters. One of the things to test with k-means is to validate that the results are stable. If you get a completely different result each time, they are not much better than random.
Nevertheless, if you just want to get some working results, you can just fix the dictionary once and choose one that is working well.
Or you might look into more advanced clustering (in particular one that is more robust wrt. noise!)

Clustering Baseline Comparison, KMeans

I'm working on an algorithm that makes a guess at K for kmeans clustering. I guess I'm looking for a data set that I could use as a comparison, or maybe a few data sets where the number of clusters is "known" so I could see how my algorithm is doing at guessing K.
I would first check the UCI repository for data sets:
http://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table
I believe there are some in there with the labels.
There are text clustering data sets that are frequently used in papers as baselines, such as 20newsgroups:
http://qwone.com/~jason/20Newsgroups/
Another great method (one that my thesis chair always advocated) is to construct your own small example data set. The best way to go about this is to start small, try something with only two or three variables that you can represent graphically, and then label the clusters yourself.
The added benefit of a small, homebrew data set is that you know the answers and it is great for debugging.
Since you are focused on k-means, have you considered using the various measures (Silhouette, Davies–Bouldin etc.) to find the optimal k?
In reality, the "optimal" k may not be a good choice. Most often, one does want to choose a much larger k, then analyze the resulting clusters / prototypes in more detail to build clusters out of multiple k-means partitions.
The iris flower dataset is a good one to start with, that clustering works nicely on.
Download here

Resources