Clustering K-means algorithm for elongated data set - machine-learning

I have go question while programming K-means algorithm in Matlab. Why K-means algorithm not suitable for classifying elongated data set?

In sort, draw some thick lines on a paper. Can you really represent each one with a single point? How would single points give information about orientation?
K-means assigns each datapoint to each nearest centroid. That is to say that for each centroid c, all points that their distance from c is smaller (in comparison to all other centroids) will be assigned to c. And, since the surface of a (hyper)sphere is in fact, all points with distance less or equal to some value from a center, I think it is easy to see how resulted clusters tend to be spherical. (To be exact, kmeans practically creates a Voronoi diagram in the vector space)
Elongated clusters however, don't necessarily satisfy the requirement that all their points are closer to their "center of mass" than to some other cluster's center.

It is difficult for you to choose a init cluster center point in elongated data set, but it has a powerful effect on the result.You may get different results when choose different points.
You will get only one result in this case when you choose 3 init points:
But it is different in elongated data set.


spectral clustering eigenvectors and eigenvalues

What do the eigenvalues and eigenvectors in spectral clustering physically mean. I see that if λ_0 = λ_1 = 0 then we will have 2 connected components. But, what does λ_2,...,λ_k tell us. I don't understand the algebraic connectivity by multiplicity.
Can we draw any conclusions about the tightness of the graph or in comparison to two graphs?
The smaller the eigenvalue, the less connected. 0 just means "disconnected".
Consider this a value of what share of edges you need to cut to produce separate components. The cut is orthogonal to the eigenvector - there is supposedly some threshold t, such that nodes below t should go into one component, above t to the other.
That depends somewhat on the algorithm. For several of the spectral algorithms, the eigenstuff can be easily run through Principal Component Analysis to reduce the display dimensionality for human consumption. Power iteration clustering vectors are more difficult to interpret.
As Mr.Roboto already noted, the eigenvector is normal to the division brane (a plane after a Gaussian kernel transformation). Spectral clustering methods are generally not sensitive to density (is that what you mean by "tightness"?) per se -- they find data gaps. For instance, it doesn't matter whether you have 50 or 500 nodes within a unit sphere forming your first cluster; the game changer is whether there's clear space (a nice gap) instead of a thin trail of "bread crumb" points (a sequence of tiny gaps) leading to another cluster.

Algorithm for selecting outer points on a graph ("rich" convex hull)

I'm looking for an efficient way of selecting a relatively large portion of points (2D Euclidian graph) that are the furthest away from the center. This resembles the convex hull, but would include (many) more points. Further criteria:
The number of points in the selection / set ("K") must be within a specified range. Most likely it won't be very narrow, but it most work for different ranges (eg. 0.01*N < K < 0.05*N as well as 0.1*N < K < 0.2*N).
The algorithm must be able to balance distance from the center and "local density". If there are dense areas near the upper part of the graph range, but sparse areas near the lower part, then the algorithm must make sure to select some points from the lower part even if they are closer to the center than the points in the upper region. (See example below)
Bonus: rather than simple distance from center, taking into account distance to a specific point (or both a point and the center) would be perfect.
My attempts so far have focused on using "pigeon holing" (divide graph into CxR boxes, assign points to boxes based on coordinates) and selecting "outer" boxes until we have sufficient points in the set. However, I haven't been successful at balancing the selection (dense regions over-selected because of fixed box size) nor at using a selected point as reference instead of (only) the center.
I've (poorly) drawn an Example: The red dots are the points, the green shape is an example of what I want (outside the green = selected). For sparse regions, the bounding shape comes closer to the center to find suitable points (but doesn't necessarily find any, if they're too close to the center). The yellow box is an example of what my Pigeon Holing based algorithms does. Even when trying to adjust for sparser regions, it doesn't manage well.
Any and all ideas are welcome!
I don't think there are any standard algorithms that will give you what you want. You're going to have to get creative. Assuming your points are embedded in 2D Euclidean space here are some ideas:
Iteratively compute several convex hulls. For example, compute the convex hull, keep the points that are part of the convex hull, then compute another convex hull ignoring the points from the original convex hull. Continue to do this until you have a sufficient number of points, essentially plucking off points on the perimeter for each iteration. The only problem with this approach is that it will not work well for concavities in your data set (e.g., the one on the bottom of your sample you posted).
Fit a Gaussian to your data and keep everything > N standard
deviations away from the mean (where N is a value that you'd have to
choose). This should work pretty well if your data is Gaussian. If
it isn't, you could always model it with several Gaussians (instead
of one), and keep points with a joint probability less than some threshold. Using multiple Gaussians will probably handle concavities decently. References:
How to fit a gaussian to data in matlab/octave?\
Use Kernel Density Estimation - If you create a kernel density
surface, you could slice the surface at some height (e.g., turning
it into a plateau), giving you a perimeter shape (the shape of the
plateau) around the points. The trick would be to slice it at the
right location though, because you could end up getting no points
outside of the shape, but with the right selection you could easily
get the green shape you drew. This approach will work well and give you the green shape in your example if you choose the slice point wisely (which may be difficult to do). The big drawback of this approach is that it is very computationally expensive. More information:
Use alpha shapes to get a general shape the wraps tightly around
the outside perimeter of the point set. Then erode the shape a
little to force some points outside of the shape. I don't have a lot of experience with alpha shapes, but this approach will also be quite computationally expensive. More info:

Why do we maximize variance during Principal Component Analysis?

I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions

Features for gesture recognition

I would like to create an application which can learn to classify a sequence of points drawn by a user, e.g. something like handwriting recognition. If the data point consists of a number of (x,y) pairs (like the pixels corresponding to a gesture instance), what are the best features to compute about the instance which would make for a good multi-class classifier (e.g. SVM, NN, etc)? Particularly if there are limited training examples provided.
If I were you, I would find the data points that correspond with corners, end points and intersections, use those as features and discard the intermediate points. You could include the angle or some other descriptor of these interest points as well.
For detecting interest points you could use a Harris detector, you could then use the gradient value at that point as a simple descriptor. Alternatively you could go with a more fancy method like SIFT.
You could use the descriptor of every pixel in your downsampled image and then classify with SVM. The disadvantage of that is that there would be a large amount of uninteresting data points in the feature vector.
An alternative would be to not approach it as a classification problem but as a template matching problem (fairly common in computer-vision). In this case a gesture can be specified as an arbitrary number of interest points, completely leaving out the non-interesting data. A certain threshold percentage of an instance's points has to match a template for a positive identification. For example, when matching the corner points of an instance of 'R' against the template for 'X', the bottom right point should match, being end points in the same position orientation, but the others are too dissimilar, giving a fairly low score and the identification R=X will be rejected.

Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
to map 30d down to 2d; see the plots under
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.
