Data points in hierarchial clustering

Data points in hierarchial clustering - machine-learning

In Hierarchy Clustering(Single Link or Complete Link), is it possible for a point to be closer to points in other clusters than to every point in its own cluster?
I understand that, in single linkage, the distance between two clusters is the minimum distance between the two groups, i.e. the distance between the data points closest to the other cluster is taken as the distance between the two clusters. On the other hand, in complete linkage the distance between the farthest points are taken as the intra cluster distance

Related

Graph where edges represent vector (force and direction) between nodes

Is there any domain (/ dedicated keyword) of graph theory that covers graphs where the edges represent forces?
Force is a vector. Thus, it has two attributes: weight, and direction.
weight represents the magnitude of the force.
direction represents the direction in which the force is acting. This direction is different from directed graphs where only the head or tail nodes matter.
The sense of direction can be better understood by the following examples:
Example 1:
Consider a network of inelastic strings under tension. Let's say the network is under equilibrium. If we pull a node, all other nodes will be pulled. Please note, the length of the strings (~ weight) won't change. But, the locations of the nodes and thereby the direction of the strings may change to bring all the nodes back to equilibrium after the pull.
Example 2: Consider all the planets (~nodes) in the universe in the form of a graph. All of them impart gravitational force (~edges) on each other and are under equilibrium. If we dislodge (or increase the size) of a planet/sun, others are likely to disturb.
The edge weight/length can represent the magnitude of force (But, direction??).
In both the example, the direction component differ them from traditional sense of edge weights where the edges are just scalars. They, do not have direction.
The scalars can be analogous to a sense of distance (shortest distance, eccentricity, closeness centralities) or flow (betweenness centrality etc.); but not force.
The question is How to incorporate direction of edges (in addition to length/weight) in network analysis? Is there any domain that focuses on graphs where edges have weights as well as direction?
Note: The direction of the edge can be an additional parameter like angle; or be specified by the location of the connecting nodes.

What you're describing sounds like force-directed graph drawing algorithms as discussed here. Since you tagged this with networkx, the spring_layout method uses the Fruchterman-Reingold force-directed algorithm.
The networkx documentation doesn't list an actual reference to the algorithm, but the R igraph package lists this as the reference for their layout_with_fr function:
Fruchterman, T.M.J. and Reingold, E.M. (1991). Graph Drawing by Force-directed Placement. Software - Practice and Experience, 21(11):1129-1164.

How to find relationship between two different stereo calibrations

I have a question for you.
I have performed a single and stereo calibration using 10 different checkerboard poses. I have acquired a image pair and obtained the 3D position of each pixel and saved in a point cloud (PCL).
After that, I have performed another calibration using 60 different checkerboard poses. The obtained calibration parameters are different from those estimated in the previous calibration.
I have used the same image pair to obtain the point cloud and get the 3D reconstruction of the scene and I notice that the corresponding 3D points in the two point clouds have different location in the space.
When the two point clouds are displayed in MeshLab, two separate point clouds in the space are represented.
I think that the origin of the "reconstructed space" is changed somehow according to calibration parameters.
How can I get the transformation between the two different coordinates system so that, known the transformation, I can display the second and the first point clouds overlapping?
The aim is to find this relationship using only stereo-calibration parameters. I know that the transformation could be computed using the correspondences between the same points displayed in the two point clouds, but I need to find out this relationship using only the calibration parameters.
Thanks!

Clustering K-means algorithm for elongated data set

I have go question while programming K-means algorithm in Matlab. Why K-means algorithm not suitable for classifying elongated data set?

In sort, draw some thick lines on a paper. Can you really represent each one with a single point? How would single points give information about orientation?
K-means assigns each datapoint to each nearest centroid. That is to say that for each centroid c, all points that their distance from c is smaller (in comparison to all other centroids) will be assigned to c. And, since the surface of a (hyper)sphere is in fact, all points with distance less or equal to some value from a center, I think it is easy to see how resulted clusters tend to be spherical. (To be exact, kmeans practically creates a Voronoi diagram in the vector space)
Elongated clusters however, don't necessarily satisfy the requirement that all their points are closer to their "center of mass" than to some other cluster's center.

It is difficult for you to choose a init cluster center point in elongated data set, but it has a powerful effect on the result.You may get different results when choose different points.
You will get only one result in this case when you choose 3 init points:
But it is different in elongated data set.

Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?

The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:

Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.

Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
use
PCA
to map 30d down to 2d; see the plots under
calculating-the-percentage-of-variance-measure-for-k-means,
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.

Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.

algorithm to traverse points horizontally and vertically

There are n points in the 2D plane. A robot wants to visit all of them but can only move horizontally or vertically. How should it visit all of them so that the total distance it covers is minimal?

This is the Travelling Salesman Problem where the distance between each pair of points is |y2-y1|+|x2-x1| (called Rectilinear distance or Manhattan distance). It's NP-hard which basically means that there is no known efficient solution.
Methods to solve it on Wikipedia.
The simplest algorithm is a naive brute force search, where you calculate the distance for every possible permutation of the points and find the minimum. This has a running time of O(n!). This will work for up to about 10 points, but it will very quickly become too slow for larger numbers of points.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart