Euclidean Distance or cosine similarity? [closed]

Euclidean Distance or cosine similarity? [closed] - search-engine

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I was reading
Similarity Measure
and suddenly my whole world was falling apart. I have implemented a search engine using Clustering Technique. For Clustering , I used K Means which has distance measure as Euclidean distance.I also used cosine similarity to display results. I was getting amazingly accurate results.But now that i read this, what i did was normalize the document vectors and calculated the euclidean distance between two vectors and hence i have not considered magnitude anywhere.
Am i doing something wrong ?
Although i think that a higher term frequency would make up for a higher tf-idf value and a higher normalized tf-idf value and hence would be appropriately ranked high.
Thanks
Results( Using not normalized vectors , the figures are euclidean distances)
61.79689257425985 222Proposed Research Details.doc
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
72.61392308146608 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc
65.51734241367222 done_RPFIII_dr.dogra.doc
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc
418.8868087170988 P. VIJAYA KUMAR (DSS).doc
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc
72.95414421468679 RPF-III__Indo-US_project.doc
82.25126123574397 220Introduction and objectives.doc
Results(With normalized vectors , the figures are euclidean distances)
1.3435369899385359 222Proposed Research Details.doc
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc
1.2902191708899362 done_RPFIII_dr.dogra.doc
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc
1.29150899172647 RPF-III__Indo-US_project.doc
1.318016051789028 220Introduction and objectives.doc
Results(figures are cosine similarity)
0.09745417833344654 222Proposed Research Details.doc
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc
0.16766724553404047 done_RPFIII_dr.dogra.doc
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc
0.16600226214483405 RPF-III__Indo-US_project.doc
0.13141684361322944 220Introduction and objectives.doc
The results 1 and 2 do not agree with each other while 2 and 3 strongly do. More similarity,lesser distance. The distances are taken between cluster centroid vector and the document vectors of each of the document.
Infact the most weird result is the document with a euclidean distance of 418 and having the most similarity of 0.87. while normalized distance becomes 0.49 and agrees with similarity.

when i'm remembering right from my information retrival lecture, normalizing both vectors leads to reverse ranking order for euclidian distance as well as for cosine similarity.

Related

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.

As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Convolution Vs Correlation

Can anyone explain me the similarities and differences, of the Correlation and Convolution ? Please explain the intuition behind that, not the mathematical equation(i.e, flipping the kernel/impulse).. Application examples in the image processing domain for each category would be appreciated too

You will likely get a much better answer on dsp stack exchange but... for starters I have found a number of similar terms and they can be tricky to pin down definitions.
Correlation
Cross correlation
Convolution
Correlation coefficient
Sliding dot product
Pearson correlation
1, 2, 3, and 5 are very similar
4,6 are similar
Note that all of these terms have dot products rearing their heads
You asked about Correlation and Convolution - these are conceptually the same except that the output is flipped in convolution. I suspect that you may have been asking about the difference between correlation coefficient (such as Pearson) and convolution/correlation.
Prerequisites
I am assuming that you know how to compute the dot-product. Given two equal sized vectors v and w each with three elements, the algebraic dot product is v[0]*w[0]+v[1]*w[1]+v[2]*w[2]
There is a lot of theory behind the dot product in terms of what it represents etc....
Notice the dot product is a single number (scalar) representing the mapping between these two vectors/points v,w In geometry frequently one computes the cosine of the angle between two vectors which uses the dot product. The cosine of the angle between two vectors is between -1 and 1 and can be thought of as a measure of similarity.
Correlation coefficient (Pearson)
Correlation coefficient between equal length v,w is simply the dot product of two zero mean signals (subtract mean v from v to get zmv and mean w from w to get zmw - here zm is shorthand for zero mean) divided by the magnitudes of zmv and zmw.
to produce a number between -1 and 1. Close to zero means little correlation, close to +/- 1 is high correlation. it measures the similarity between these two vectors.
See http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient for a better definition.
Convolution and Correlation
When we want to correlate/convolve v1 and v2 we basically are computing a series of dot-products and putting them into an output vector. Let's say that v1 is three elements and v2 is 10 elements. The dot products we compute are as follows:
output[0] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[1] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[2] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[3] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[4] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[5] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
output[6] = v1[0]*v2[8]+v1[1]*v2[9]+v1[2]*v2[10] #note this is
#mathematically valid but might give you a run time error in a computer implementation
The output can be flipped if a true convolution is needed.
output[5] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[4] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[3] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[2] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[1] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[0] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
Notice that we have less than 10 elements in the output as for simplicity I am computing the convolution only where both v1 and v2 are defined
Notice also that the convolution is simply a number of dot products. There has been considerable work over the years to be able to speed up convolutions. The sweeping dot products are slow and can be sped up by first transforming the vectors into the fourier basis space and then computing a single vector multiplication then inverting the result, though I won't go into that here...
You might want to look at these resources as well as googling: Calculating Pearson correlation and significance in Python

The best answer I got were from this document:http://www.cs.umd.edu/~djacobs/CMSC426/Convolution.pdf
I'm just going to copy the excerpt from the doc:
"The key difference between the two is that convolution is associative. That is, if F and G are filters, then F*(GI) = (FG)*I. If you don’t believe this, try a simple example, using F=G=(-1 0 1), for example. It is very convenient to have convolution be associative. Suppose, for example, we want to smooth an image and then take its derivative. We could do this by convolving the image with a Gaussian filter, and then convolving it with a derivative filter. But we could alternatively convolve the derivative filter with the Gaussian to produce a filter called a Difference of Gaussian (DOG), and then convolve this with our image. The nice thing about this is that the DOG filter can be precomputed, and we only have to convolve one filter with our image.
In general, people use convolution for image processing operations such as smoothing, and they use correlation to match a template to an image. Then, we don’t mind that correlation isn’t associative, because it doesn’t really make sense to combine two templates into one with correlation, whereas we might often want to combine two filter together for convolution."

Convolution is just like correlation, except that we flip over the filter before correlating

How to normalize tf-idf vectors for SVMs?

I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?

The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.

How do I cluster with KL-divergence?

I want to cluster my data with KL-divergence as my metric.
In K-means:
Choose the number of clusters.
Initialize each cluster's mean at random.
Assign each data point to a cluster c with minimal distance value.
Update each cluster's mean to that of the data points assigned to it.
In the Euclidean case it's easy to update the mean, just by averaging each vector.
However, if I'd like to use KL-divergence as my metric, how do I update my mean?

Clustering with KL-divergence may not be the best idea, because KLD is missing an important property of metrics: symmetry. Obtained clusters could then be quite hard to interpret. If you want to go ahead with KLD, you could use as distance the average of KLD's i.e.
d(x,y) = KLD(x,y)/2 + KLD(y,x)/2

It is not a good idea to use KLD for two reasons:-
It is not symmetry KLD(x,y) ~= KLD(y,x)
You need to be careful when using KLD in programming: the division may lead to Inf values and NAN as a result.
Adding a small number may affect the accuracy.

Well, it might not be a good idea use KL in the "k-means framework". As it was said, it is not symmetric and K-Means is intended to work on the euclidean space.
However, you can try using NMF (non-negative matrix factorization). In fact, in the book Data Clustering (Edited by Aggarwal and Reddy) you can find the prove that NMF (in a clustering task) works like k-means, only with the non-negative constrain. The fun part is that NMF may use a bunch of different distances and divergences. If you program python: scikit-learn 0.19 implements the beta divergence, which has a variable beta as a degree of liberty. Depending on the value of beta, the divergence has a different behavour. On beta equals 2, it assumes the behavior of the KL divergence.
This is actually very used in the topic model context, where people try to cluster documents/words over topics (or themes). By using KL, the results can be interpreted as a probabilistic function on how the word-topic and topic distributions are related.
You can find more information:
FÉVOTTE, C., IDIER, J. “Algorithms for Nonnegative Matrix
Factorization with the β-Divergence”, Neural Computation, v. 23, n.
9, pp. 2421– 2456, 2011. ISSN: 0899-7667. doi: 10.1162/NECO_a_00168.
Dis- ponível em: .
LUO, M., NIE, F., CHANG, X., et al. “Probabilistic Non-Negative
Matrix Factorization and Its Robust Extensions for Topic Modeling.”
In: AAAI, pp. 2308–2314, 2017.
KUANG, D., CHOO, J., PARK, H. “Nonnegative matrix factorization for
in- teractive topic modeling and document clustering”. In:
Partitional Clus- tering Algorithms, Springer, pp. 215–243, 2015.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

K-means is intended to work with Euclidean distance: if you want to use non-Euclidean similarities in clustering, you should use a different method. The most principled way to cluster with an arbitrary similarity metric is spectral clustering, and K-means can be derived as a variant of this where the similarities are the Euclidean distances.
And as #mitchus says, KL divergence is not a metric. You want the Jensen-Shannon divergence or its square root named as the Jensen-Shannon distance as it has symmetry.

The best way to calculate the best threshold with P. Viola, M. Jones Framework

I'm trying to implement P. Viola and M. Jones detection framework in C++ (at the beginning, simply sequence classifier - not cascaded version). I think I have designed all required class and modules (e.g Integral images, Haar features), despite one - the most important: the AdaBoost core algorithm.
I have read the P. Viola and M. Jones original paper and many other publications. Unfortunately I still don't understand how I should find the best threshold for the one weak classifier? I have found only small references to "weighted median" and "gaussian distribution" algorithms and many pieces of mathematics formulas...
I have tried to use OpenCV Train Cascade module sources as a template, but it is so comprehensive that doing a reverse engineering of code is very time-consuming. I also coded my own simple code to understand the idea of Adaptive Boosting.
The question is: could you explain me the best way to calculate the best threshold for the one weak classifier?
Below I'm presenting the AdaBoost pseudo code, rewritten from sample found in Google, but I'm not convinced if it's correctly approach. Calculating of one weak classifier is very slow (few hours) and I have doubts about method of calculating the best threshold especially.
(1) AdaBoost::FindNewWeakClassifier
(2) AdaBoost::CalculateFeatures
(3) AdaBoost::FindBestThreshold
(4) AdaBoost::FindFeatureError
(5) AdaBoost::NormalizeWeights
(6) AdaBoost::FindLowestError
(7) AdaBoost::ClassifyExamples
(8) AdaBoost::UpdateWeights
DESCRIPTION (1)
-Generates all possible arrangement of features in detection window and put to the vector
DO IN LOOP
-Runs main calculating function (2)
END
DESCRIPTION(2)
-Normalizes weights (5)
DO FOR EACH HAAR FEATURE
-Puts sequentially next feature from list on all integral images
-Finds the best threshold for each feature (3)
-Finds the error for each the best feature in current iteration (4)
-Saves errors for each the best feature in current iteration in array
-Saves threshold for each the best feature in current iteration in array
-Saves the threshold sign for each the best feature in current iteration in array
END LOOP
-Finds for classifier index with the lowest error selected by above loop (6)
-Gets the value of error from the best feature
-Calculates the value of the best feature in the all integral images (7)
-Updates weights (8)
-Adds new, weak classifier to vector
DESCRIPTION (3)
-Calculates an error for each feature threshold on positives integral images - seperate for "+" and "-" sign (4)
-Returns threshold and sign of the feature with the lowest error
DESCRIPTION(4)
- Returns feature error for all samples, by calculating inequality f(x) * sign < sign * threshold
DESCRIPTION (5)
-Ensures that samples weights are probability distribution
DESCRIPTION (6)
-Finds the classifier with the lowest error
DESCRIPTION (7)
-Calculates a value of the best features at all integral images
-Counts false positives number and false negatives number
DESCRIPTION (8)
-Corrects weights, depending on classification results
Thank you for any help

In the original viola-Jones paper here, section 3.1 Learning Discussion (para 4, to be precise) you will find out the procedure to find optimal threshold.
I'll sum up the method quickly below.
Optimal threshold for each feature is sample-weight dependent and therefore calculated in very iteration of adaboost. The best weak classifier's threshold is saved as mentioned in the pseudo code.
In every round, for each weak classifier, you have to arrange the N training samples according to the feature value. Putting a threshold will separate this sequence in 2 parts. Both parts will have either positive or negative samples in majority along with a few samples of other type.
T+ : total sum of positive sample weights
T- : total sum of negative sample weights
S+ : sum of positive sample weights below the threshold
S- : sum of negative sample weights below the threshold
Error for this particular threshold is -
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
Why the minimum? here's an example:
If the samples and threshold is like this -
+ + + + + - - | + + - - - - -
In the first round, if all weights are equal(=w), taking the minimum will give you the error of 4*w, instead of 10*w.
You calculate this error for all N possible ways of separating the samples.
The minimum error will give you the range of threshold values. The actual threshold is probably the average of the adjacent feature values (I'm not sure though, do some research on this).
This was the second step in your DO FOR EACH HAAR FEATURE loop.
The cascades given along with OpenCV were created by Rainer Lienhart and I don't know what method he used.
You could closely follow the OpenCV source codes to get any further improvements on this procedure.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart