I have an accident location dataset. I have applied several clustering algorithms on this dataset using the column latitude and longitude. Now I would like to measure the accuracy of different clustering algorithms separately to compare between them.
I want to apply the confusion matrix described in this article.
But I am not able to understand what I should consider as a label? I have made my clusters using only two columns latitude and longitude. Can anyone guide me, please? I have the code but it's not clear to me. I mean what is the label or class label in my case?
In a confusion matrix you provide two sets of labels for each entry. One of these labels is the cluster assignment generated by the clustering you did. The second label can be the ground truth, which allows you to determine accuracy/precision.
Your case sounds like there is no ground truth, so you can't compare for accuracy. You CAN use the result of one of the different algorithms you used as the second set of labels, to compare the result between these two clusterings.
Related
When using KNeighborsClassifier what is the motivation of using weights="distance" ?
According to sklearn docs:
‘distance’ : weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
What is the motivation of using this?
The idea of a nearest-neighbors classifier is to consider those points of the training set which are close to the point you want to classify, and guess this point's class based on their known class labels.
If all these close training points have the same label, the result is clear. But what if they don't all have the same label? You could take their most common label, but this may not always be the best guess.
For example, imagine one training point with label A being very close to the point you want to classify, while two training points with label B are somewhat further away, but still close. Should the new point be labelled A or B? Weighting the points by how close they are (i.e. by the inverse of their distance) provides an objective way to answer this question.
Why is it wrong to think that it only needs the data since it: "outputs a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation)."
However, I also need to input the labels (which the function itself computes); so, why are the labels necessary to input?
how similar an object is to its own cluster
In order to compute the silhouette, you need to know to which cluster your samples belong.
Also:
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.
You need the labels to know what "intra-cluster" and "nearest-cluster" mean.
Silhouette_score is a metric for clustering quality, not a clustering algorithm. It considers both the inter-class and intra-class distance.
For that calculation to happen, you need to supply both the data and target labels (estimated by unsupervised methods like K-means).
Is there any way to reduce the dimension of the following features from 2D coordinate (x,y) to one dimension?
Yes. In fact, there are infinitely many ways to reduce the dimension of the features. It's by no means clear, however, how they perform in practice.
A feature reduction usually is done via a principal component analysis (PCA) which involves a singular value decomposition. It finds the directions with highest variance -- that is, those direction in which "something is going on".
In your case, a PCA might find the black line as one of the two principal components:
The projection of your data onto this one-dimensional subspace than yields the reduced form of your data.
Already with the eye one can see that on this line the three feature sets can be separated -- I coloured the three ranges accordingly. For your example, it is even possible to completely separate the data sets. A new data point then would be classified according to the range in which its projection onto the black line lies (or, more generally, the projection onto the principal component subspace) lies.
Formally, one could obtain a division with further methods that use the PCA-reduced data as input, such as for example clustering methods or a K-nearest neighbour model.
So, yes, in case of your example it could be possible to make such a strong reduction from 2D to 1D, and, at the same time, even obtain a reasonable model.
In image processing, how region growing and clustering differ from each other ? Give more information on how they differ. Thank you for reading
Region growing :
You have to select seed points and then the local area around the seed is analyzed in order to know if the neighbor pixels should have the same label. http://en.wikipedia.org/wiki/Region_growing
It can be used for precise image segmentation.
Clustering :
There are many clustering techniques (k-means, hierarchical clustering, density clustering, etc.). Clustering algorithms don't ask to input seed points because they are based on unsupervised learning.
It can be use for coarse image segmentation.
I found region growing similar to some clustering algorithm. I explained my view point below:
In region growing there are 2 cases:
Selecting seed points randomly which is similar to k-mean. seeds
play the role of means here. Then, we start with one seed and spread
it until we cannot grow it anymore (like we start with one mean and
we continue till we reach a convergence). and the way we grow the
region is based on the euclidean distance form seed grey value
(usually).
Second case in region growing can be considered with no seed (assume
we don't know how many seeds to choose or we don't know the number
of clusters). So we start with the first pixel. Then we find
neighbors of current pixel with considering distance d from mean
grey value of the region (of course at first iteration mean grey
value is exactly current grey value). Afterwards, we update the mean
grey value. In this way region growing seems to act like mean shift
algorithm. If we don't update mean grey value after each assigning,
then it could be considered as a DBSCAN algorithm.
How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.
A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.
The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.