Clustering different class - machine-learning

I would like to cluster my dataset which has multiple classes (up to 10). But this clustering problem is different than usual clustering. I need to cluster different classes (as shown in the image; https://ibb.co/iiNbqv) instead same/similar classes. Which method should I use? what would you recommend?
The problem as follow;
I have several frames/images (up to 10), and each frame has hundred of thousand detections. So, as the data I am processing is the location (x and y coordinates) of the detection. What I am going to do that how much detections are overlapped within these frames with a certain distance threshold. The constraint is that each detection of a frame shouldn't be in the same overlapped cluster more than once as seen in the picture. So basically, I should find the nearest detection of a point from other frames, and put them in same cluster. But once I do that every point in the cluster shouldn't be far each other more than distant threshold
Cheers

Based on your image and the circles you drew, this seems a problem of clustering different data points based on distance between them. A simple euclidian distance based clustering algorithm should provide the results you are looking for.
Something like this would cluster data points according to a measure of distance between them. The only parameter is the distance threshold, which you should set to one that is consistent with your problem.
#PSEUDO-CODE!
#p is a list of all data points
for i=all points
if( p(i).used == false)
{
#Create new cluster c
p(i).cluster = c
for n=1:all_Points
{
if ( p(n).used == false)
{
if p(n).class not in cluster
{
d=dist(p(i),p(n))
if d<max
{
p(n).cluster = c;
p(n).used = true;
}
}
}
}
This basically goes to each point and finds which of them are near it, and assigns them to the same cluster. There are a lot of variations of this clustering routine that achieve different goals.
For example, you can compare the distance between the current centroid of the cluster rather than the point that initialized the cluster, or compare with the last added point to the cluster rather than the first one. Depends on what would work better for the nature of your data.

Clustering won't help you much here, because it is too exploratory.
You should instead look into optimization in general. In particular, your problem has similarities to the set cover problem. For all I can tell, you want to cover all instances with sets of three, such that the three elements are in different "classes" and most similar?
Based on the optimization theory results, you will likely be able to prove that this problem is NP hard, and therefore a greedy approximation algorithm is the preferable way of handling this.

Related

Clustering accuracy check with Confusion Matrix

I have an accident location dataset. I have applied several clustering algorithms on this dataset using the column latitude and longitude. Now I would like to measure the accuracy of different clustering algorithms separately to compare between them.
I want to apply the confusion matrix described in this article.
But I am not able to understand what I should consider as a label? I have made my clusters using only two columns latitude and longitude. Can anyone guide me, please? I have the code but it's not clear to me. I mean what is the label or class label in my case?
In a confusion matrix you provide two sets of labels for each entry. One of these labels is the cluster assignment generated by the clustering you did. The second label can be the ground truth, which allows you to determine accuracy/precision.
Your case sounds like there is no ground truth, so you can't compare for accuracy. You CAN use the result of one of the different algorithms you used as the second set of labels, to compare the result between these two clusterings.

Finding displacement between two camera frames

I'm currently working on a visual odometry project. Currently I've implemented up to Essential Matrix decomposition stage. But the resulting translation vector is normalized and cannot be able to plot the movement.
Now how can I compute the displacement in some scale? I have seen some suggestions to use planner homography to compute the absolute translation. I didn't got the idea of doing it as, the outdoor environment is not simply planner. At least, by considering the ground as planner, how to obtain, the translation of it. I've seen a suggestion here. Is it possible to use this approach to get the displacement between two frames?
What you are referring to is called registration. This is a vast field. There are methods for linear transformation across the entire image, and per pixel methods ( the two ends of the spectrum). Naturally per pixel methods are far slower typically and have many local errors.
Typically two frames have very little transformation between them and simple Homography will do to find the general scaling between them. Especially if you are talking about aerial photos. If your case is very far from planar then you may want to use something closer to pixel-wise. For example using spline fitting: https://www.mathworks.com/matlabcentral/fileexchange/20057-b-spline-grid--image-and-point-based-registration
You cannot recover scale, generally speaking, unless you can recognize in the scene 1 or more objects of known physical size.

Clustering K-means algorithm for elongated data set

I have go question while programming K-means algorithm in Matlab. Why K-means algorithm not suitable for classifying elongated data set?
In sort, draw some thick lines on a paper. Can you really represent each one with a single point? How would single points give information about orientation?
K-means assigns each datapoint to each nearest centroid. That is to say that for each centroid c, all points that their distance from c is smaller (in comparison to all other centroids) will be assigned to c. And, since the surface of a (hyper)sphere is in fact, all points with distance less or equal to some value from a center, I think it is easy to see how resulted clusters tend to be spherical. (To be exact, kmeans practically creates a Voronoi diagram in the vector space)
Elongated clusters however, don't necessarily satisfy the requirement that all their points are closer to their "center of mass" than to some other cluster's center.
It is difficult for you to choose a init cluster center point in elongated data set, but it has a powerful effect on the result.You may get different results when choose different points.
You will get only one result in this case when you choose 3 init points:
But it is different in elongated data set.

Calculating heat map weights based on clustering of points

I have an array of MKLocationCoordinate2D in iOS and I'd like to create a heat map of those points based on the clustering of them.
i.e. the more there are in a certain area then the higher the weight.
I've found a load of different frameworks for generating the heat maps and they all require the weights to be calculated yourself (which makes sense).
I'm just not sure where to start with the calculation.
I could do something like calculating the mean distance between each point and every other point but I'm not sure if that's a good idea.
Could someone point me in the direction of how to weight each point based on it's closeness to other points.
Thanks
I solved this by implementing a quad tree and using that to quickly get the number of neighbours within a certain radius.
I can then change the radius to tweak it but it will very quickly return weights based on how many neighbours each point has.

Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
use
PCA
to map 30d down to 2d; see the plots under
calculating-the-percentage-of-variance-measure-for-k-means,
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.

Resources