Davies-boudin index and maximum ratio - machine-learning

Davies-bouldin index validation is basically the ratio within cluster scatter and between cluster distances. We iterate that for all clusters and finally take the maximum. My question here is why maximum not minimum?
Thank you.

Consider the following scenario:
Three clusters. One is well separated from the others, two are conflated.
Let S_i be 0.5 for all of them.
For the conflated ones, M_ij is close to zero. For the well separated ones, the distance of the means is much larger. The resulting R_i is large for the conflated ones, and small for the separated clusters.
If you take the maximum, the index says "two clusters are mixed up, the result is thus bad - not all clusters are well separated". If you used the minimum, it would ignore this problem and say "well, at least it separated them from one of the other clusters".

Related

HDBSCAN difference between parameters

I'm confused about the difference between the following parameters in HDBSCAN
min_cluster_size
min_samples
cluster_selection_epsilon
Correct me if I'm wrong.
For min_samples, if it is set to 7, then clusters formed need to have 7 or more points.
For cluster_selection_epsilon if it is set to 0.5 meters, than any clusters that are more than 0.5 meters apart will not be merged into one. Meaning that each cluster will only include points that are 0.5 meters apart or less.
How is that different from min_cluster_size?
They technically do two different things.
min_samples = the minimum number of neighbours to a core point. The higher this is, the more points are going to be discarded as noise/outliers. This is from DBScan part of HDBScan.
min_cluster_size = the minimum size a final cluster can be. The higher this is, the bigger your clusters will be. This is from the H part of HDBScan.
Increasing min_samples will increase the size of the clusters, but it does so by discarding data as outliers using DBSCAN.
Increasing min_cluster_size while keeping min_samples small, by comparison, keeps those outliers but instead merges any smaller clusters with their most similar neighbour until all clusters are above min_cluster_size.
So:
If you want many highly specific clusters, use a small min_samples and a small min_cluster_size.
If you want more generalized clusters but still want to keep most detail, use a small min_samples and a large min_cluster_size
If you want very very general clusters and to discard a lot of noise in the clusters, use a large min_samples and a large min_cluster_size.
(It's not possible to use min_samples larger than min_cluster_size, afaik)

how to calculate the area between two cars

I would like to know if from an image like the example it is possible to calculate the area between two consecutive cars:
Detect the two objects, calculate the distances between my camera and the two objects so deduce the area between the two objects
Any advice or references would be welcome thanks
https://i.stack.imgur.com/4IM6y.jpg
This is not a super scalable solution as it would vary by country. But license plates (at least in the US) are always of a similar dimension. This could be used to give you almost perfect distance reference and they are easy to detect. The difficulty remaining would be in estimating the space taken up by the partial view of the near car... but this would significantly reduce the complexity of the problem. To get that remaining bit, I would likely try to identify a tire/ hubcap and apply an offset to that... as I imagine that will get you pretty close (within 1-2 feet)

Evaluating the confidence of an image registration process

Background:
Assuming there are two shots for the same scene from two different perspective. Applying a registration algorithm on them will result in Homography Matrix that represents the relation between them. By warping one of them using this Homography Matrix will (theoretically) result in two identical images (if the non-shared area is ignored).
Since no perfection is exist, the two images may not be absolutely identical, we may find some differences between them and this differences can be shown obviously while subtracting them.
Example:
Furthermore, the lighting condition may results in huge difference while subtracting.
Problem:
I am looking for a metric that I can evaluate the accuracy of the registration process. This metric should be:
Normalized: 0->1 measurement which does not relate to the image type (natural scene, text, human...). For example, if two totally different registration process on totally different pair of photos have the same confidence, let us say 0.5, this means that the same good (or bad) registeration happened. This should applied even one of the pair is for very details-reach photos and the other of white background with "Hello" in black written.
Distinguishing between miss-registration accuracy and different lighting conditions: Although there is many way to eliminate this difference and make the two images look approximately the same, I am looking of measurement that does not count them rather than fixing them (performance issue).
One of the first thing that came in mind is to sum the absolute differences of the two images. However, this will result in a number that represent the error. This number has no meaning when you want to compare it to another registration process because another images with better registration but more details may give a bigger error rather than a smaller one.
Sorry for the long post. I am glad to provide any further information and collaborating in finding the solution.
P.S. Using OpenCV is acceptable and preferable.
You can always use invariant (lighting/scale/rotation) features in both images. For example SIFT features.
When you match these using typical ratio (between nearest and next nearest), you'll have a large set of matches. You can calculate the homography using your method, or using RANSAC on these matches.
In any case, for any homography candidate, you can calculate the number of feature matches (out of all), which agree with the model.
The number divided by the total matches number gives you a metric of 0-1 as to the quality of the model.
If you use RANSAC using the matches to calculate the homography, the quality metric is already built in.
This problem is given two images decide how misaligned they are.
Thats why we did the registration. The registration approach cannot answer itself how bad a job it did becasue if it knew it it would have done it.
Only in the absolute correct case do we know the result: 0
You want a deterministic answer? you add deterministic input.
a red square in a given fixed position which can be measured how rotated - translated-scaled it is. In the conditions of lab this can be achieved.

clustering with limited maximum size

I want to cluster some data points but the maximum number of points per cluster is limited. So there is a maximum size per cluster. Is there any clustering algorithm for that?
Also Can I define my own size function. For example, instead of considering the number of points in a cluster as its size, I want to sum a column of all the points in the cluster.
A quick and not a optimal solution is spliting data into 2 parts iteratively until the number of data is under the limitation.
The problem of k-means clustering with minimum size constraints is addressed in this paper:
Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.
However, the approach proposed in this paper can be easily extended to the maximum size constraints.
Here is an implementation of this algorithm and an extension to it which addresses both minimum size and maximum size constraints.
AS for your question about custom size function, it will be a more difficult problem for which I guess local search approaches are more appropriate.
As clustering will usually try to make the clusters as large as possible, this isn't really clustering then anymore. More like a minimum spanning tree, where you remove the longest edges to find groups.
You could try something like x-means, i.e. a k-means variation where you split clusters that you consider to be too large.

How to compute histograms using weka

Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:
Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use
three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.
Now this is trivial and simply a matter of counting. The next part asks the following though:
Identify a way to compute the above CLIQUE result by only using the functions of
Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate .
Hint : Just two tabs are needed.
I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!
I am assuming you have 23 instances (rows) and 6 attributes (dimensions)
Use three equal intervals per dimension
Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.
weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last
After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.
a cell as dense if it contains at least five objects.
You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.

Resources