Is kmeans repeatable? - machine-learning

I wanted to know if we get roughly the same centroid points for the exact same data set given that the initial centroid points are chosen randomly.
I'm writing a test kmeans program, and they don't seem to match. I wanted to know if what I'm doing is right.

The k-means algorithm requires some initialization of the centroid positions. For most algorithms, these centroids are randomly initialized with some method such as the Forgy method or random partitioning, which means that repeated iterations of the algorithm can converge to vastly different results.
Remember that k-means is iterative, and at each "move centroid" step, each centroid is moved to a position that minimizes its distance from its constituent points. This makes it heavily dependent on the starting position.
Because of this, it's usually advisable to run k-means several times, and select the clustering that minimizes the error.

No it is not guaranteed.
Consider a simple case of 2-means with 4 points: (1, 1), (-1, 1), (1, -1), (-1, -1) (a square in a 2D plane)
then the 2 centroids may be {(0, 1), (0, -1)} or {(1, 0), (-1, 0)}, two very different results.

Many k-means implementations allow fixing the random number generator to make results reproducible.
ELKI: -kmeans.seed parameter
Weka: -s parameter
In others, you can usually provide the initial centers yourself, and then use reproducible pseudo-random seeding to choose them yourself.

Related

Is it possible to use libsvm for multilabel regression problems

I have a scenario where I need to predict spherical co-ordinates (r,theta,phi) depending upon the values of 6 attributes.I am using Libsvm with regression option. If I individually predict labels according to the object instance, it doesnt make sense. Also if I combine labels and assign a specific label for each r,theta,phi, it is not meaningful and SVM not converging in prediction. I want SVM to analyse the combination of three coordinates and accordingly create a training model. Is it possible? Please advise.
Not really: SVM is a classification algorithm, not a prediction algorithm. So far as SVM is concerned, a label of (0, 0, 1) is just as distinct from (0, 0, 2) as it is from (20, 3, -1): "not the same".
If you have a regression problem, then use a regression model: do a little research to find one that matches whatever your data set characteristics suggest.
UPDATE PER OP COMMENT
From what little you've said, it sounds to me as if you want a multivariate regression, with a single loss function that describes the deviation from the desired output triple. You're correct that three separate regressions won't work for this scenario: the position in space depends on a non-linear combination of the three outputs.
I suggest that you make your loss function some useful distance function between the true and predicted positions. You will need to experiment with your model features, using linear, squared, and other terms for each of the six inputs. I can't suggest anything, as you haven't adequately described the problem.

Clustering K-means algorithm for elongated data set

I have go question while programming K-means algorithm in Matlab. Why K-means algorithm not suitable for classifying elongated data set?
In sort, draw some thick lines on a paper. Can you really represent each one with a single point? How would single points give information about orientation?
K-means assigns each datapoint to each nearest centroid. That is to say that for each centroid c, all points that their distance from c is smaller (in comparison to all other centroids) will be assigned to c. And, since the surface of a (hyper)sphere is in fact, all points with distance less or equal to some value from a center, I think it is easy to see how resulted clusters tend to be spherical. (To be exact, kmeans practically creates a Voronoi diagram in the vector space)
Elongated clusters however, don't necessarily satisfy the requirement that all their points are closer to their "center of mass" than to some other cluster's center.
It is difficult for you to choose a init cluster center point in elongated data set, but it has a powerful effect on the result.You may get different results when choose different points.
You will get only one result in this case when you choose 3 init points:
But it is different in elongated data set.

Using an ANN to calculate a position vector's length and the angle between it and the x-axis

I'm new to neural networks and trying to get the hang of it by solving the following task:
Given a semi circle which defines an area above the x-axis, I would like to teach an ANN to output the length of a vector pointing to any position within that area. In addition, I would also like to know the angle between it and the x-axis.
I thought of this as a classical example of supervised learning and used Backpropagation to train a feed-forward network. The network is built by two Input-, two Output-, and variable amount of Hidden-neurons organised in a variable amount of hidden layers.
My training data is a random and unsorted sample of points within that area and the respective desired values. The coordinates of the points serve as the input of the net while I use the calculated values to minimise the error.
However, even after thousands of training iterations and empirical changes of the networks topology, I am unable to produce results with an error below ~0.2 (Radius: 20.0, Topology: 2/4/2).
Are there any obvious pitfalls I'm failing to see or does the chosen approach just not fit the task? Which other network types and/or learning techniques could be used to complete the task?
I wouldn't use variable amounts of hidden layers, I would use just one.
Then, I wouldn't use two output neurons, I would use two separate ANNs, one for each of the values you're after. This should do better, since your outputs aren't clearly related in my opinion.
Then, I would experiment with number of hidden neurons between 2 and 10 and different activation functions (logistic and tanh, maybe ReLUs).
After that, do you scale your data? It might be worth scaling both your inputs and outputs. Sigmoid units return small numbers, so it is good if you can adapt your outputs to be small as well (in [-1 , 1] or [0, 1]). For example, if want your angles in degrees, divide all of your targets by 360 before training the ANN on them. Then when the ANN returns a result, multiply it by 360 and see if that helps.
Finally, there are a number of ways to train your neural network. Gradient descent is the classic, but probably not the best. Better methods are conjugate gradient, BFGS etc. See here for optimizers if you're using python - even if not, they might give you an idea of what to search for in your language.

How is a homography calculated?

I am having quite a bit of trouble understanding the workings of plane to plane homography. In particular I would like to know how the opencv method works.
Is it like ray tracing? How does a homogeneous coordinate differ from a scale*vector?
Everything I read talks like you already know what they're talking about, so it's hard to grasp!
Googling homography estimation returns this as the first link (at least to me):
http://cseweb.ucsd.edu/classes/wi07/cse252a/homography_estimation/homography_estimation.pdf. And definitely this is a poor description and a lot has been omitted. If you want to learn these concepts reading a good book like Multiple View Geometry in Computer Vision would be far better than reading some short articles. Often these short articles have several serious mistakes, so be careful.
In short, a cost function is defined and the parameters (the elements of the homography matrix) that minimize this cost function are the answer we are looking for. A meaningful cost function is geometric, that is, it has a geometric interpretation. For the homography case, we want to find H such that by transforming points from one image to the other the distance between all the points and their correspondences be minimum. This geometric function is nonlinear, that means: 1-an iterative method should be used to solve it, in general, 2-an initial starting point is required for the iterative method. Here, algebraic cost functions enter. These cost functions have no meaningful/geometric interpretation. Often designing them is more of an art, and for a problem usually you can find several algebraic cost functions with different properties. The benefit of algebraic costs is that they lead to linear optimization problems, hence a closed form solution for them exists (that is a one shot /non-iterative method). But the downside is that the found solution is not optimal. Therefore, the general approach is to first optimize an algebraic cost and then use the found solution as starting point for an iterative geometric optimization. Now if you google for these cost functions for homography you will find how usually these are defined.
In case you want to know what method is used in OpenCV simply need to have a look at the code:
http://code.opencv.org/projects/opencv/repository/entry/trunk/opencv/modules/calib3d/src/fundam.cpp#L81
This is the algebraic function, DLT, defined in the mentioned book, if you google homography DLT should find some relevant documents. And then here:
http://code.opencv.org/projects/opencv/repository/entry/trunk/opencv/modules/calib3d/src/fundam.cpp#L165
An iterative procedure minimizes the geometric cost function.It seems the Gauss-Newton method is implemented:
http://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm
All the above discussion assumes you have correspondences between two images. If some points are matched to incorrect points in the other image, then you have got outliers, and the results of the mentioned methods would be completely off. Robust (against outliers) methods enter here. OpenCV gives you two options: 1.RANSAC 2.LMeDS. Google is your friend here.
Hope that helps.
To answer your question we need to address 4 different questions:
1. Define homography.
2. See what happens when noise or outliers are present.
3. Find an approximate solution.
4. Refine it.
Homography in a 3x3 matrix that maps 2D points. The mapping is linear in homogeneous coordinates: [x2, y2, 1]’ ~ H * [x1, y1, 1]’, where ‘ means transpose (to write column vectors as rows) and ~ means that the mapping is up to scale. It is easier to see in Cartesian coordinates (multiplying nominator and denominator by the same factor doesn’t change the result)
x2 = (h11*x1 + h12*y1 + h13)/(h31*x1 + h32*y1 + h33)
y2 = (h21*x1 + h22*y1 + h23)/(h31*x1 + h32*y1 + h33)
You can see that in Cartesian coordinates the mapping is non-linear, but for now just keep this in mind.
We can easily solve a former set of linear equations in Homogeneous coordinates using least squares linear algebra methods (see DLT - Direct Linear Transform) but this unfortunately only minimizes an algebraic error in homography parameters. People care more about another kind of error - namely the error that shifts points around in Cartesian coordinate systems. If there is no noise and no outliers two erros can be identical. However the presence of noise requires us to minimize the residuals in Cartesian coordinates (residuals are just squared differences between the left and right sides of Cartesian equations). On top of that, a presence of outliers requires us to use a Robust method such as RANSAC. It selects the best set of inliers and rejects a few outliers to make sure they don’t contaminate our solution.
Since RANSAC finds correct inliers by random trial and error method over many iterations we need a really fast way to compute homography and this would be a linear approximation that minimizes parameters' error (wrong metrics) but otherwise is close enough to the final solution (that minimizes squared point coordinate residuals - a right metrics). We use a linear solution as a guess for further non-linear optimization;
The final step is to use our initial guess (solution of linear system that minimized Homography parameters) in solving non-linear equations (that minimize a sum of squared pixel errors). The reason to use squared residuals instead of their absolute values, for example, is because in Gaussian formula (describes noise) we have a squared exponent exp(x-mu)^2, so (skipping some probability formulas) maximum likelihood solutions requires squared residuals.
In order to perform a non-linear optimization one typically employs a Levenberg-Marquardt method. But in the first approximation one can just use a gradient descent (note that gradient points uphill but we are looking for a minimum thus we go against it, hence a minus sign below). In a nutshell, we go through a set of iterations 1..t..N selecting homography parameters at iteration t as param(t) = param(t-1) - k * gradient, where gradient = d_cost/d_param.
Bonus material: to further minimize the noise in your homography you can try a few tricks: reduce a search space for points (start tracking your points); use different features (lines, conics, etc. that are also transformed by homography but possibly have a higher SNR); reject impossible homographs to speed up RANSAC (e.g. those that correspond to ‘impossible’ point movements); use low pass filter for small changes in Homographies that may be attributed to noise.

Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
use
PCA
to map 30d down to 2d; see the plots under
calculating-the-percentage-of-variance-measure-for-k-means,
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.

Resources