Euclidean Distance - machine-learning

I have some problem understanding euclidean distance. I have two different entities and I want to measure the similarity between these entities.
Lets suppose that entity A has 2 feature vectors and entity B has 1 feature vector only. How am I supposed to calculate the euclidean distance between these two entities in order to know the similarity?
Thanks a lot.

you can calculate the eucledean distance only for vectors of the same dimension. But you could define some default values for the features that are missin in entity 2

L2 is between two feature vectors. These two would be natural ways of doing it:
You could find the minimum L2 distance between all the feature vectors of entity 1 and all the feature vectors of entity 2. If we have 2 vector for entity 1 like A=[1,3,2,1] and B=[3,2,4,1] AND 1 vector for entity 2 like C=[1,2,4,2]. Then dist = min(d([1,3,2,1],[1,2,4,2]),d([3,2,4,1],[1,2,4,2])
You could find the average vectors between all the vectors of entity 1 and the average vector of entity 2. Then compute the L2 distance. If we have 2 vector for entity 1 like A=[1,3,2,1] and B=[3,2,4,1] AND 1 vector for entity 2 like C=[1,2,4,2]. Then dist = d([(1+3)/2,(3+2)/2,(2+4)/2,(1+1)/2],[1,2,4,2])

This is not a bad question at all.
Sometimes mathematicians define the Euclidean distance between two sets (A and B) of elements as the minimum distance between any two pairs of elements from either set.
You can also use the maximum over these two sets. That is called the Hausdorff distance.
Distance between two sets
In other words, you can compute the Euclidean distance between each element of set A to each element of set B and then define the distance, d(A,B), between the two sets as the minimum (or maximum) distance of any of the element pairs that you've computed.
Hausdorff (maximum) distance has some nicer mathematical properties and on the space of non-empty, compact sets (which your element will be since they are discrete) it will be a proper mathematical distance, in that it satisfies:
For all non-empty compact sets A,B,C
d(A,B) >= 0 (with d(A,B) = 0 if and only if A=B)
d(A,B) = d(B,A)
d(A,B) <= d(A,C) + d(C,B)

Related

Bhattacharya distance between R,G,B Y Cb Cr components of two images

I have 2 images taken from two different cameras and I have to associate an object in both images. I have separated RGB ycbcr components and calculated the histogram of each component separately from both images
Then I concatenated histograms of all components into one vector.
I have already normalized each histogram separately so that sum(h)=1;
But when I have concatenated all histograms sum of that vector= 6.
and
when I applied Bhattacharya distance on both vectors the result is in range 4 and 5.
I cannot understand the similarity results because as per my knowledge result of Bhattacharya distance is 0-1
Please help
the best Bhattacharya distance is 2;
it is Jeffreys-Matusita distance that measure of Battachaya distance
if you have 2 class and the Jeffreys-Matusita was near 2 its good for classification and if it war near 0 the classes are same

K means clustering for multidimensional data

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data)
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
if i calculate the mean of values of each row, will that be the centroid?
and how do I plot resulting clusters in matlab.
OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel, column 2 the values for the feature Region and so on.
K-Means
Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm in Matlab) bin.
After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.
Some Matlab starting points
You load the data by X = load('path/to/the/dataset', '-ascii');
In your case X will be a 440x8 matrix.
You can calculate the Euclidean distance from an example to a centroid by
distance = norm(example - centroid1);,
where both, example and centroid1 have dimensionality 1x8.
Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1 now contains all examples that are closest to centroid1 and therefore Bin1 has dimensionality 127x8, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);. You would do similar things to your other bins.
As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot() function.

Optimal pairs of farthest points

I have a even set of points in 2D. I need an algorithm that can make pairs of those points such that total sum of distance between pairs is maximum.
Dynamic Programming, greedy approach won't work, I think.
Can I use Linear Programming or Hungarian algo? or any other?
You certainly can use integer linear programming. Here is an example formulation:
Introduce a binary variable x[ij] for each unordered couple of distrinct points i and j (i.e. such as i<j), where x[ij]=1 iff the points i and j are grouped together.
Compute all the distances d[ij] (for i<j).
The objective is to maximize sum_[i<j] d[ij]*x[ij], subject to the constraints that each point is in exactly one pair, i.e. forall j, sum_[i<j] x[ij] = 1.
Note that this work also for 3d points: you only need the distance between two pairs of points.

What does the distance attribute in DMatches mean?

I have a short question: When I do feature-matching in OpenCV, what does the distance attribute mean of DMatches in MatOfMatches?
I know that I have to filter matches with bigger distance because they aren't as good as them with lower distance. But what is the meaning of this attribute? Is it a kind of deviation?
In this context, a feature is a point of interest on the image. In order to compare features, you "describe" them using a feature detector. Each feature is then associated to a descriptor. When you match features, you actually match their descriptors.
A descriptor is a multidimensional vector. It can be real-valued (e.g. SIFT) or binary (e.g. BRIEF).
A matching is a pair of descriptors, one from each image, which are the most similar among all of the descriptors. And of course, to find the descriptor in image B that is the most similar to a descriptor in image A, you need a measure of this similarity.
There are multiple ways to compute a "score of similarity" between two vectors. For real-valued descriptors, the Euclidean distance is often used, when the Hamming distance is common for binary descriptors.
As a conclusion, we can now understand the distance attribute: it is the score of similarity between the two descriptors of a match.
Distance attribute in DMatch is a measure of similarity between the two descriptors(feature vectors). If the distance is less, then the images are more similar and vice versa.
A lesson learnt from my experience when I started out:
Do not confuse the DMatch.distance with the normal spatial distance between two points. Both are different. The distance in the DMatch represents the distance between two descriptors(vectors with 128 values in case of SIFT)
In case of SIFT (local feature descriptor):
1) First, you detect key points(interesting points) for the two images that you want to compare.
2) Then you compute sift descriptors for a defined area (16 X 16 neighbourhood around each key point) around all the key points. Each descriptor stores the histogram of oriented gradients for the area around each key point.
3) Finally, the descriptors of both the images are matched to find matching key points between the images. This is done by using BFMatcher -> match(), knnMatch() or FlannBasedMatcher -> knnMatch().
4) If you are using BFMatcher.match(), you will get a list of DMatch objects. The number of DMatch objects is equal to the number of matches. Each DMatch object contains following four attributes for each matched key point pair.
DMatch.distance - Distance between descriptors. The lower, the better it is.
DMatch.trainIdx - Index of the descriptor in train descriptors(1st image)
DMatch.queryIdx - Index of the descriptor in query descriptors(2nd image)
DMatch.imgIdx - Index of the train image.
5) DMatch.Distance can be one of many distance measures -> Norm_L1, Norm_L2(Euclidean distance), Hamming distance, Hamming2 distance,... which can be mentioned as a parameter in BFMatcher. The Default distance is Euclidean.
6) Difference between Spatial Euclidean distance and DMatch Euclidean distance:
SIFT descriptor 1 -> [a1,a2,....a128]
SIFT descriptor 2 -> [b1,b2,....b128]
(DMatch) -> Euclidean distance = sqrt[(a1-b1)^2 + (a2-b2)^2 +...+(a128-b128)^2]
Point 1 -> (x1, y1)
Point 2 -> (x2, y2)
(Spatial) -> Euclidean distance = sqrt[(x2-x1)^2 + (y2-y1)^2]
Thus, this distance from DMatch is the distance between descriptors and it represents the degree of similarity between two descriptors and is different from the normal spatial euclidean distance between two points.
If the distance between the descriptors is less, then their similarity is high. If the distance between the descriptors is more, then their similarity is low.
Hope this helps in understanding the meaning of distance attribute in DMatch objects. If you are clear on this, then you can work with any feature descriptors like HOG, SIFT, SURF, ORB, BRISK, FREAK,... All of them are similar when it comes to matching their respective feature descriptors.
Usually when you are matching two features, you are actually comparing two vectors under certain distance metrics. Now let's assume your feature is SIFT with 128 dimensions, and you compare two SIFT features a and b using Euclidean distance, then DMatch.distance is equal to

K Nearest Neighbor classifier

I have set of about 200 points (x,y) of an image. The 200 data belong to 11 classes (which I think will become the class labels). My problem is how do I represent the x, y values as one data?
My first thought is that I should represent them separately with the labels and then when I get a point to classify, I will classify x and y separately. Something in me tells me that this is incorrect.
Please advice me how to present the x,y value as one data element.
I can't imagine what problem you meet. In kNN algorithm, we can use variables with multiple dimensions, you just need to use list in python standard library or array in Numpy library to organize the data such as : group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
or group = [[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]] to represent (1.0,1.1) (1.0,1.0) (0,0) (0,0.1).
However, I suggest to use numpy, as there're so many functions in it and they are implemented by C language which ensure the efficiency of programs.
If you use numpy, you'd better to do all the operations in matrix way, for example, you can use point=tile([0,0],(3,1))anddistance(group-point)(distance is a function written by me) to calculate the distance without iteration.
The key is not representation but distance calculation instead. The points in your case are essentially one element but with two dimensions (x, y). The kNN algorithm can handle n-dimension case itself: it finds the k-nearest neighbors. So you can use the euclidean distance d((x1, y1), (x2, y2))=((x1-x2)^2 + (y1-y2)^2)^0.5, where (x1, y1) represents the first point to calculate, as the distances of points in your case.

Resources