More accurate approach than k-mean clustering - machine-learning

In Radial Basis Function Network (RBF Network), all the prototypes (center vectors of the RBF functions) in the hidden layer are chosen. This step can be performed in several ways:
Centers can be randomly sampled from some set of examples.
Or, they can be determined using k-mean clustering.
One of the approaches for making an intelligent selection of prototypes is to perform k-mean clustering on our training set and to use the cluster centers as the prototypes.
All we know that k-mean clustering is caracterized by its simplicity (it is fast) but not very accurate.
That is why I would like know what is the other approach that can be more accurate than k-mean clustering?
Any help will be very appreciated.

Several k-means variations exist: k-medians, Partitioning Around Medoids, Fuzzy C-Means Clustering, Gaussian mixture models trained with expectation-maximization algorithm, k-means++, etc.
I use PAM (Partitioning around Medoid) in order to be more accurate when my dataset contain some "outliers" (noise with value which are very different to the others values) and I don't want the centers to be influenced by this data. In the case of PAM a center is called a Medoid.

There is a more statistical approach to cluster analysis, called the Expectation-Maximization Algorithm. It uses statistical analysis to determine clusters. This is probably a better approach when you have a lot of data regarding your cluster centroids and training data.
This link also lists several other clustering algorithms out there in the wild. Obviously, some are better than others, depending on the amount of data you have and/or the type of data you have.
There is a wonderful course on Udacity, Intro to Artificial Intelligence, where one lesson is dedicated to unsupervised learning, and Professor Thrun explains some clustering algorithms in very great detail. I highly recommend that course!
I hope this helps,

In terms of K-Means, you can run it on your sample a number of times (say, 100) and then choose the clustering (and by consequence the centroids) that has the smallest K-Means criterion output (the sum of the square Euclidean distances between each entity and its respective centroid).
You can also use some initialization algorithms (the intelligent K-Means comes to mind, but you can also google for K-Means++). You can find a very good review of K-Means in a paper by AK Jain called Data clustering: 50 years beyond K-means.
You can also check hierarchical methods, such as the Ward method.

Related

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

How To Fight Randomness Caused By KMeans Clustering

I'm developing an algorithm to classify different types of dogs based off of image data. The steps of the algorithm are:
Go through all training images, detect image features (ie SURF), and extract descriptors. Collect all descriptors for all images.
Cluster within the collected image descriptors and find k "words" or centroids within the collection.
Reiterate through all images, extract SURF descriptors, and match the extracted descriptor with the closest "word" found via clustering.
Represent each image as a histogram of the words found in clustering.
Feed these image representations (feature vectors) to a classifier and train...
Now, I have run into a bit of a problem. Finding the "words" within the collection of image descriptors is a very important step. Due to the random nature of clustering, different clusters are found each time I run my program. The unfortunate result is that sometimes the accuracy of my classifier will be very good, and other times, very bad. I have chalked this up to the clustering algorithm finding "good" words sometimes, and "bad" words other times.
Does anyone know how I can hedge against the clustering algorithm from finding "bad" words? Currently I just cluster several times and take the mean accuracy of my classifier, but there must be a better way.
Thanks for taking time to read through this, and thank you for your help!
EDIT:
I am not using KMeans for classification; I am using a Support Vector Machine for classification. I am using KMeans for finding image descriptor "words", and then using these words to create histograms which describe each image. These histograms serve as feature vectors that are fed to the Support Vector Machine for classification.
There are many possible ways of making clustering repeatable:
The most basic method of dealing with k-means randomness is simply running it multiple times and selecting the best one (the one that minimizes the inner cluster distances/maximizes the between clusters distance).
One can use some fixed initialization for your data instead of randomization. There are many heuristics for starting the k-means. Or at least minimize the variance by using algorithms like k-means++.
Use modification of k-means which guarantees global minimum of regularized function, ie. convex k-means
Use different clustering method, which is deterministic, ie. Data Nets
I would offer two possible suggestions, in addition to those provided.
K-means optimises an objective related to the distance between cluster points and their centroids. You care about classification accuracy. Depending on the computational cost, a simple brute-force approach is to induce multiple clusterings on a subset of your training data, and evaluate the performance of each on some held-out development set for the task you care about. Then use the highest performing variant as the final model. I don't like the use of non-random initialisation because this is only a solution to avoid the randomness, not find the true global minimum of the objective, and your chosen initialisation may be useless and just produce consistently bad classifiers.
The other approach, which is much harder, is to view the k-means step as a dimensionality reduction to enable classification, and incorporate this into the classifier directly. If you use a deep neural net, the layer(s) closest to the input are essentially dimensionality reducers in the same way as the k-means clustering you induce: the difference is their weights are set wrt the error of the net on the classification problem, rather than some unrelated intermediate step. The downside is that this is much closer to a current research problem: training deep nets is hard. You could start with a standard one-hidden-layer architecture (with binary activations on the hidden layer, and using cross-entropy loss on the output layer with outputs coded as one-of-n categories), and attempt to add layers incrementally, but as far as I'm aware standard training algorithms start to behave poorly beyond the single hidden layer, so you'd need to investigate layer-wise training to initialise, or some of the Hessian-Free stuff coming out of Geoff Hinton's group in Toronto.
That is actually an important problem with the BofW approach, and you should share this prominently. SIFT data may actually not have k-means clusters at all. However, due to the nature of the algorithm, k-means will always produce k clusters. One of the things to test with k-means is to validate that the results are stable. If you get a completely different result each time, they are not much better than random.
Nevertheless, if you just want to get some working results, you can just fix the dictionary once and choose one that is working well.
Or you might look into more advanced clustering (in particular one that is more robust wrt. noise!)

ML / density clustering on house areas. two-component or more mixtures in each dimension

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!
What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.
Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.
In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

Performance Analysis of Clustering Algorithms

I have been given 2 data sets and want to perform cluster analysis for the sets using KNIME.
Once I have completed the clustering, I wish to carry out a performance comparison of 2 different clustering algorithms.
With regard to performance analysis of clustering algorithms, would this be a measure of time (algorithm time complexity and the time taken to perform the clustering of the data etc) or the validity of the output of the clusters? (or both)
Is there any other angle one look at to identify the performance (or lack of) for a clustering algorithm?
Many thanks in advance,
T
It depends a lot on what data you have available.
A common way of measuring the performance is with respect to existing ("external") labels (albeit that would make more sense for classification than for clustering). There are around two dozen measures you can use for this.
When using an "internal" quality measure, make sure that it is independent of the algorithms. For example, k-means optimizes such a measure, and will always come out best when evaluating with respect to this measure.
There are two categories of clustering evaluation methods and the choice depends
on whether a ground truth is available. The first category is the extrinsic methods which require the existence of a ground truth and the other category is the intrinsic methods. In general, extrinsic methods try to assign a score to a clustering, given the ground truth, whereas intrinsic methods evaluate clustering by examining how well the clusters are separated and how compact they are.
For extrinsic methods (remember you need to have a ground available) one option is to use the BCubed precision and recall metrics. The BCubed precion and recall metrics differ from the traditional precision and recall in the sense that clustering is an unsupervised learning technique and therefore we do not know the labels of the clusters beforehand. For this reason BCubed metrics evaluate the precion and recall for evry object in a clustering on a given dataset according to the ground truth. The precision of an example is an indication of how many other examples in the same cluster belong to the same category as the example. The recall of an example reflects how many examples of the same category are assigned to the same cluster. Finally, we can combine these two metrics in one using the F2 metric.
Sources:
Data Mining Concepts and Techniques by Jiawei Han, Micheline, Kamber and Jian Pei
http://www.cs.utsa.edu/~qitian/seminar/Spring11/03_11_11/IR2009.pdf
My own experience in evaluating the performance of clustering
A simple approach for the extrinsic methods where there is a ground truth available is to use a distance metric between clusterings; the ground truth is simply considered to be a clustering. Two good measures to use are the Variation of Information by Meila and, in my humble opinion, the split join distance by myself also discussed by Meila. I do not recommend the Mirkin index or the Rand index - I've written more about it here on stackexchange.
These metrics can be split into two constituent parts, each representing the distance of one of the clusterings to the largest common subclustering. It is worthwhile to consider both parts; if the ground truth part (to common subclustering) is very small, it means that the tested clustering is close to a superclustering; if the other part is small it means that the tested clustering is close to the common subclustering and hence close to a subclustering of the ground truth. In both cases the clustering can be said to be compatible with the ground truth. For more information see the link above.
There are several benchmarks for the clustering algorithms evaluation with extrinsic quality measures (accuracy) and intrinsic measures (some internal statistics of the formed clusters):
Clubmark demonstrated in ICDM'18
WebOCD, see description in the paper
Circulo
ParallelComMetric
CluSim
CoDAR (the sources might be acquired from the paper authors)
Selection of the appropriate benchmark depends on the kind of the clustering algorithm (hard or soft clustering), kind (pairwise relations, attributed datasets or mixed) and size of the clustering data, required evaluation metrics and the admissible amount of the supervision. The Clubmark paper describes evaluation criteria in details.
The Clubmark is developed for the fully automatic parallel evaluation of many clustering algorithms (processing input data specified by the pairwise relations) on many large datasets (millions and billions of clustering elements) and evaluated mostly by accuracy metrics tracing resource consumption (processing and execution time, peak resident memory consumption, etc.).
But for a couple of algorithms on a couple of datasets even the manual evaluation is appropriate.

Unsupervised classification methods available

I'm doing a research which involves "unsupervised classification".
Basically I have a trainSet and I want to cluster data in X number of classes in unsupervised way. Idea is similar to what k-means does.
Let's say
Step1)
featureSet is a [1057x10] matrice and I want to cluster them into 88 clusters.
Step2)
Use previously calculated classes to compute how does the testData is classified
Question
-Is it possible to do it with SVM or N-N ? Anything else ?
-Any other recommendations ?
There are many clustering algorithms out there, and the web is awash with information on them and sample implementations. A good starting point is the Wikipedia entry on cluster analysis Cluster_analysis.
As you have a working k-means implementation, you could try one of the many variants to see if they yeild better results (k-means++ perhaps, seeing as you mentioned SVM). If you want a completely different approach, have a look at Kohonen Maps - also called Self Organising Feature Maps. If that looks too tricky, a simple hierarchical clustering would be easy to implement (find the nearest two items, combine, rinse and repeat).
This sounds like a classic clustering problem. Neither SVMs or neural networks are going to be able to directly solve this problem. You can use either approach for dimensionality reduction, for example to embed your 10-dimensional data in two-dimensional space, but they will not put the data into clusters for you.
There are a huge number of clustering algorithms besides k-means. If you wanted a contrasting approach, you might want to try an agglomerative clustering algorithm. I don't know what kind of computing environment you are using, but I quite like R and this (very) short guide on clustering.

Resources