K-Means alternatives and performance - opencv

I've been reading about similarity measures and image feature extraction; most of the papers refer to k-means as a good uniform clustering technique and my question is, is there any alternative to k-means clustering that performs better for an specific set?

You may want to look at MeanShift clustering which has several advantages over K-Means:
Doesn't require a preset number of clusters
K-Means clusters converge to n-dimensional voronoi grid, MeanShift allows other cluster shapes
MeanShift is implemented in OpenCV in the form of CAMShift which is a MeanShift adaptation for tracking objects in a video sequence.
If you need more info, you can read this excellent paper about MeanShift and Computer Vision:
Mean shift: A robust approach toward feature space analysis

A simple first step, you could generalize k-means to EM. But there are tons of clustering methods available and the kind of clustering you need depends on your data (features) and the applications. In some cases, even your distances you use matters and so might have to do some sort of distance transformation, if it is not in the kind of space you want it to be in.

Related

Is there any subspace clustering method for dealing with numerical attributes?

I am trying to apply some clustering method on my datasets (with numerical dimensions). But I'm convinced that the features have different weights for different clusters. I read that there is an approach called soft subspace clustering that tries do identify the clusters and the weights of the features for each cluster simultaneously. However, the algorithms that I have found apre applied only to categorical data.
I am trying to identify some algorithm of soft subspace clustering for numerical. Do you know if there is any, or how can I adapt methods originally designed to deal with categorical data for dealing with numerical data (I think that it would necessary to propose some way of measuring the relevance of each numerical feature in each cluster)?
Yes, there are dozens of subspace clustering algorithms.
You'll need to do a proper literature research, this is too broad to cover in a QA like stack overflow. Look for (surprise) "subspace clustering", but also include "biclustering", for example.

Graph for high dimensional data in Mahout

I am interested in running the spectral clustering algorithm in Mahout on high dimensional data. My question is how does one take a list of high dimensional data vectors and create a nearest neighbor graph? Is this done in Mahout or are there map-reducable ways of doing such a thing.
There's nothing like that in the project, not for making a k-NNG. Spectral clustering, yes. Yes I'm sure you can implement this in MapReduce. The question is just how to do better than brute-force computing k-nearest neighbors.

Non mahout java - implementation of Canopy clustering

I have my own java based implementation of clustering (knn). However I am facing scalability issues. I do not plan to use Mahout because my requirements are very simple and mahout requires lot of work. I am looking for java based Canopy clustering implementation which i can plug into my algo and do parellel processing.
Mahout based Canopy libraries are coupled with Vectors and indexes and does not work on plain strings. If you know of the way, where i can use canopy clustering on strings using simple library, it would fix my issue.
My requirement is to pass list of strings(say 10K) to Canopy clustering algo and it should return sublists based on T1 and T2.
Canopy clustering is mostly useful as a preprocessing step for parallelization. I'm not sure how much it will get you on a single node. I figure you might as well compute the actual algorithm right away, or build an index such as an M-tree.
The strength of Canopy clustering is that you can run it independently on a number of nodes and then just overlap their results.
Also check if it actually is compatible to your approach. I figure that canopy might need metric properties to be correct. Is your string distance a proper metric (i.e. triangle inequality)?
10,000 data points, if that's all you're concerned with, should be no problem with standard k-means. I'd look at optimising that before you consider canopy clustering (which is really designed for millions or even billions of examples). Some things you may have missed:
pre-compute the feature vectors for each string. Don't do it every time you want to compare s_1 to s_2 or s_1 to cluster centroid
you only need to keep the summary statistics in memory: the sum of all points assigned to a cluster and the number of points assigned to a cluster. When you're done with an iteration, divides sums by ns and you have your new centroids.
what's the dimensionality of your feature space? be aware that you should use a distance metric where the dimensions where both vectors are zero have no impact, so you should only need to compute for non-zero dimensions. Store your points as sparse vectors to facilitate this.
Can you do some analysis and determine where the bottle-neck in your implementation is? I'm a little perplexed by your comment about Mahout not working with plain strings.
You should give the clustering algorithms in ELKI a try. Sorry for so shamelessly promoting a project I'm closely affiliated with. But it is the largest collection of clustering and outlier detection algorithms that are implemented in a comparable fashion. (If you'd take all the clustering algorithms available in some R package, you might end up with more algorithms, but they won't be really comparable because of implementation differences)
And benchmarking showed enormous speed differences with different implementations of the same algorithm. See our benchmarking web site on how much performance can vary even on simple algorithms such as k-means.
We do not yet have Canopy Clustering. The reason is that it's more of a preprocessing index than actually a clustering algorithm. Kind of like a primitive variant of the M-tree, or of DBSCAN clustering. However, we should would like to see a contributed canopy clustering as such a preprocessing step.
ELKIs abilities to process strings are also a bit limited so far. You can load typical TF-IDF vectors just fine and we have somewhat optimized sparse vector classes and similarity functions. They don't fully exploit sparsity for k-means yet, though, and there is no spherical k-means yet either. But there are various reasons why k-means results on sparse vectors cannot be expected to be very meaningful; it's more of a heuristic.
But it would be interesting if you could give it a try for your problem and report back your experiences. Was the performance somewhat competitive with your implementation? And we would love to see contributed modules for text processing, such as e.g. further optimized similarity functions, or a spherical k-means variant.
Update: ELKI now actually includes CanopyClustering: CanopyPreClustering (will be part of 0.6.0 then). But as of now, it's just another clustering algorithm, and not yet used to accelerate other algorithms such as k-means. I need to check how to best use it as some kind of index to accelerate algorithms. I can imagine it also helps for speeding up DBSCAN if you set T1=epsilon and T2=0.5*T1. The big issue with CanopyClustering IMHO is how to choose a good radius.

FFT and Music Comparison

I'm trying to play around with some music clustering algorithms, and I thought that using a feature vector consisting of basically a discretized fft (like discretize the frequencies) would be a good similarity measure. Would this even be useful? Do people know what some good audio similarity measures might be?
First of all, you need to decide whether you want fingerprinting (i.e. identity except for some distortion) or similarity (but not identity!) measures.
Also have a look at MFCC, bark scales and so on. There is plenty of literature out there. Go to Amazon, and grab a dedicated book on this topic.
You can use a hierarchical cluster like a kd-tree or a hilbert curve before you discretize. A cluster reduces the dimension complexity and change the order of the input while a fft just transform it to waves.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources