I was reading stuffs about pattern recognition. Recently I want to make a survey of methods to evaluate similarities of vectors. As far as I know, there are Euclidean distances, Mahalanobis distances and Cosine Distance. Can anyone present some more names or keywords to search?
Also mutual neighbor distance (MND), Minkowski metric, Hausdorff distance, conceptual similarity, normalized Google distance, KL divergence, Spearman’s rank correlation, and Lin similarity. (Not all of these are vector based.)
I highly recommend Pattern Classification by Duda, Hart, and Stork for further reading. It is extensively cited.
Pearson, Manhatten, Gower, Jaccard, Tanimoto, Russel Rao, Dice, Kulczynski, Simple Matching, Levenshtein
You can define your own distance metrics too, so I would say there can be A LOT of possible distance metrics. Now if those metrics are good or have any meaning is another story.
Hamming distance
Related
I have a dataset which contains continues and categorical type of data.
Can anyone help with below questions.
Can i use k-means clustering and euclidean distance or should i use Gower's distance?
When to use euclidean distance/Gower's distance?
How can we know we have formed the correct clusters?
K-means and Euclidean distance are defined on a vector space of real numbers. Thus, they are not defined on mixed data. Hence you can't use them, it would not be k-means / Euclidean distance but something different.
There are plenty of alternatives if you do some research in literature.
You never know whether you have the "correct" clusters, because there is no such thing as a "right" or "wrong" cluster. It's subjective, not objective.
I have some set of documents, I just want to group related docs. Currently I'm using google's news vector file (GoogleNews-vectors-negative300.bin) and with this vector file I'm getting the vector and I use WMD (Word Mover Distance) algorithm to get distance between two documents. Now I want to integrate this with K-means clustering.Basically I want to override the distance calculation function in KMeans. How can I do that? Any suggestion are most welcome. Thanks in advance.
Although it is possible in theory implement k-means with other distance measures, it is not advised - your algorithm could stop converging. More detailed discussion can be found e.g. on StackExchange. That's why scikit-learn does not feature other distance metrics.
I'd suggest using e.g. hierarchical clustering, where you can plug in arbitrary distance function.
Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing, but it should tunned & tested up a lot to find params.
The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the the k nearest neighbors with KDTrees, but you will need to recompute what the cosine similarity is.
The cosine similarity is not a distance metric as normally presented, but it can be transformed into one. If done, you can then use other structures like Ball Trees to do accelerated nn with cosine similarity directly. I've implemented this in the JSAT library, if you were interested in a Java implementation.
According to the table at the end of this page, cosine support eoth k-d-tree should be possible: ELKI supports cosine with the R-tree, and you can derive bounding rectangles for the k-d-tree, too; and the k-d-tree supports at least five metrics in that table. So I do not see why it shouldn't work.
Indexing support in sklearn often is not very complete (albeit improving), unfortunately; so don't take that as a reference.
While the k-d-tree can theoretically support Cosine by
transforming the data such that Cosine becomes Euclidean distance
working with the bounding boxes and the minimum angle to the bounding box (that appears to be what ELKI is doing for the R-tree)
You should be aware that the k-d-tree does not work very well with high-dimensional data, and cosine is mostly popular for very high-dimensional data. A k-d-tree always only looks at one dimension. If you want all d dimension to be used once, you need O(2^d) data points. For high d, there is no way all attributes are used.
The R-tree is slightly better here because it uses bounding boxes; these shrink with every split in all dimensions, so the pruning does get better. But this also means it needs a lot of memory for such data, and the tree construction may suffer from the same problem.
So in essence, don't use either for high dimensional data.
But also don't assume that Cosine does magically improve your results, in particular for high-d data. It's very much overrated. As above transformation indicates, there cannot be a systematic benefit of Cosine over Euclidean: Cosine is a special case of Euclidean.
For sparse data, inverted lists (c.f. Lucene, Xapian, Solr, ...) are the way to index for cosine.
What is an efficient and correct metric I can use to compare two images in matrix form? I have built a machine learning model which predicts an image and want to see how far off it is from the target using a number for easy comparision.
There is a lot of different methods you can use. I guess the most popular ones are:
Euclidean Distance
Chord Distance
Pearson’s Correlation Coefficient
Spearman Rank Coefficient
You can also study about these and other metrics (their main advantages and drawbacks) from here: Image Registration - Principles, Tools and Methods / Authors: Goshtasby, A. Ardeshir
DOI: 10.1007/978-1-4471-2458-0
Hope it helps.
Adding to the excellent start from Victor Oliveira Antonino, I suggest starting with either Pearson's or Cosine. The rank coefficient isn't particularly applicable for this space; Euclidean and chord distance have properties that don't represent as well our human interpretations of image similarity.
Each metric has advantages and disadvantages. When you get into an application that doesn't map readily to physical distance, then Euclidean distance is unlikely to be the best choice.
I am using mahout recommenditembased algorithm. What are the differences between all the --similarity Classes available? How to know what is the best choice for my application? These are my choices:
SIMILARITY_COOCCURRENCE
SIMILARITY_LOGLIKELIHOOD
SIMILARITY_TANIMOTO_COEFFICIENT
SIMILARITY_CITY_BLOCK
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION
SIMILARITY_EUCLIDEAN_DISTANCE
What does it mean each one?
I'm not familiar with all of them, but I can help with some.
Cooccurrence is how often two items occur with the same user. http://en.wikipedia.org/wiki/Co-occurrence
Log-Likelihood is the log of the probability that the item will be recommended given the characteristics you are recommending on. http://en.wikipedia.org/wiki/Log-likelihood
Not sure about tanimoto
City block is the distance between two instances if you assume you can only move around like you're in a checkboard style city. http://en.wikipedia.org/wiki/Taxicab_geometry
Cosine similarity is the cosine of the angle between the two feature vectors. http://en.wikipedia.org/wiki/Cosine_similarity
Pearson Correlation is covariance of the features normalized by their standard deviation. http://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Euclidean distance is the standard straight line distance between two points. http://en.wikipedia.org/wiki/Euclidean_distance
To determine which is the best for you application you most likely need to have some intuition about your data and what it means. If your data is continuous value features than something like euclidean distance or pearson correlation makes sense. If you have more discrete values than something along the lines of city block or cosine similarity may make more sense.
Another option is to set up a cross-validation experiment where you see how well each similarity metric works to predict the desired output values and select the metric that works the best from the cross-validation results.
Tanimoto and Jaccard are similars, is a statistic used for comparing the similarity and diversity of sample sets.
https://en.wikipedia.org/wiki/Jaccard_index