Tanimoto Score and when it is used - machine-learning

I read a wiki article which describes about Jaccard index and explains the Tanimoto score as extended Jaccard index, but what exactly it tries to do?
How is it different from other similarity scores?
When is it used?
Thank you

I just read the wikipedia article too, so I can only interpret the content for you.
Jaccards score is used for vectors that take discrete values, most often for binary values (1 or 0). Tanimoto score is used for vectors that can take on continuous values. It is designed so that if the vector only takes values of 1 and 0, it works the same as Jaccards.
I would imagine you would Tanimoto's when you have a 'mixed' vector that has some continuous valued parts and some binary valued parts.

what exactly it tries to do?
Tanimoto score assumes that each data object is a
vector of attributes. The attributes may or may not be binary in this case. If they all are binary, the Tanimoto method reduces to the Jaccard method.
T(A,B)= A.B/(||A||2 + ||B||2 - A.B)
In the equation, A and B are data objects represented by vectors. The similarity score
is the dot product of A and B divided by the squared magnitudes of A and B minus the
dot product.
How is it different from other similarity scores?
Tanimoto v/s Jaccard: If the attributes are binary, Tanimoto is reduced to Jaccard Index.
There are various similarity scores available but let's compare with the most frequently used.
Tanimoto v/s Dice:
The Tanimoto coefficent is determined by looking at the number of attributes that are common to both data objects (the intersection of the data strings) compared to the number of attributes that are in either (the union of the data objects).
The Dice coefficient is the number of attributes in common to both data objects relative to the average size of the total number of attributes present, i.e.
( A intersect B ) / 0.5 ( A + B )
D(A,B) = A.B/(0.5(||A||2 + ||B||2))
Tanimoto v/s Cosine
Finding the cosine similarity between two data objects requires that both objects represent their attributes in a vector. Similarity is then measured as the angle between the two vectors.
Cos(θ) = A.B/(||A||.||B||)
You can also refer When can two objects have identical Tanimoto and Cosine score.
Tanimoto v/s Pearson:
The Pearson Coefficient is a complex and sophisticated approach to finding similarity. The method generates a "best fit" line between attributes in two data objects. The Pearson Coefficient is found using the following equation:
p(A,B) = cov(A,B)/σAσB
where,
cov(A,B) --> Covariance
σ A --> Standard deviation of A
σ B --> Standard deviation of B
The coefficient is found from dividing the covariance by the product of the standard deviations of the attributes of two data objects. It is more robust against data that isn't normalized. For example, if one person ranked movies "a", "b", and "c" with scores of 1, 2, and 3 respectively, he would have a perfect correlation to someone who ranked the same movies with a 4, 5, and 6.
For more information on Tanimoto score v/s other similarity scores/coefficients you can refer:
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
When is it used?
Tanimoto score can be used in both the situations:
When attributes are Binary
When attributes are Non-Binary
Following applications extensively use Tanimoto score:
Chemoinformatics
Clustering
Plagiarism detection
Automatic thesaurus extraction
To visualize high-dimensional datasets
Analyze market-basket transactional data
Detect anomalies in spatio-temporal data

Related

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity
with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say:
Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity.
I then recalled that the default for the sim2 vector similarity function in the R text2vec package is to L2-norm vectors first:
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none"))
So I'm wondering, what might be the motivation for this, normalizing and cosine (both in terms of text2vec and in general). I tried to read up on the L2 norm, but mostly it comes up in the context of normalizing before using the Euclidean distance. I could not find (surprisingly) anything on whether L2-norm would be recommended for or against in the case of cosine similarity on word vector spaces/embeddings. And I don't quite have the math skills to work out the analytic differences.
So here is a question, meant in the context of word vector spaces learned from textual data (either just co-occurrence matrices possible weighted by tfidf, ppmi, etc; or embeddings like GloVe), and calculating word similarity (with the goal being of course to use a vector space+metric that best reflects the real-world word similarities). Is there, in simple words, any reason to (not) use L2 norm on a word-feature matrix/term-co-occurrence matrix before calculating cosine similarity between the vectors/words?
If you want to get cosine similarity you DON'T need to normalize to L2 norm and then calculate cosine similarity. Cosine similarity anyway normalizes the vector and then takes dot product of two.
If you are calculating Euclidean distance then u NEED to normalize if distance or vector length is not an important distinguishing factor. If vector length is a distinguishing factor then don't normalize and calculate Euclidean distance as it is.
text2vec handles everything automatically - it will make rows have unit L2 norm and then call dot product to calculate cosine similarity.
But if matrix already has rows with unit L2 norm then user can specify norm = "none" and sim2 will skip first normalization step (saves some computation).
I understand confusion - probably I need to remove norm option (it doesn't take much time to normalize matrix).

Building regression using Categorical features

I am trying to use house price prediction as a practical example to learn machine learning. Currently I ran into the problem regarding to neighborhood.
With most machine learning examples, I saw features such as number of bedrooms, floor spaces, land area are used. Intuitively, these features has strong correlations to house prices. However, this is not the case for neighborhood. Let's say I randomly assign a neighborhood_id to each neighborhood. I won't be able to tell neighborhood with id 100 has higher or lower house price than neighborhood with id 53.
I am wondering if I need to do some data pre-processing, such as find the average price for each neighborhood then use the processed data, or there are existing machine learning algorithm that figure out the relation from seemingly unrelated feature?
I'm assuming that you're trying to interpret the relationship between neighborhood and housing price in a regression model with continuous and categorical data. From what I remember, R handles categorical variables automatically using one-hot encoding.
There are ways to approach this problem by creating data abstractions from categorical variables:
1) One-Hot Encoding
Let's say you're trying to predict housing prices from floor space and neighborhood. Assume that floor space is continuous and neighborhood is categorical with 3 possible neighborhoods, being A, B and C. One possibility is to encode neighborhood as a one-hot vector and treat each categorical variables as a new binary variable:
neighborhood A B C
A 1 0 0
B 0 1 0
B 0 1 0
C 0 0 1
The regression model would be something like:
y = c0*bias + c1*floor_space + c2*A + c3*B + c4*C
Note that this neighborhood variable is similar to bias in regression models. The coefficient for each neighborhood can be interpreted as the "bias" of the neighborhood.
2) From categorical to continuous
Let's call Dx and Dy the horizontal and vertical distances from all neighborhoods to a fixed point on the map. By doing this, you create a data abstraction that transforms neighborhood, a categorical variable, into two continuous variables. By doing this, you can correlate housing prices to horizontal and vertical distance from your fixed point.
Note that this is only appropriate when the transformation from categorical to continuous makes sense.

Learning to tag sentences with keywords based on examples

I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.
How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.
If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.
TF-IDF computation
TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:
Compute TF-IDF values for all words in all documents
- TF-IDF score of a word W in document D is
TF-IDF(W, D) = TF(W, D) * IDF(W)
where TF(W, D) is frequency of word W in document D
IDF(W) = log(N/(2 + #W))
N - number of documents
#W - number of documents that contain word W
- words contained in the title will count twice (means more important)
- normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.
Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.
Similarity computation
Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.
After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.
If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.
I hope it helps.
Best of luck :)
Given your description, you are looking for some form of supervised learning. There are many methods in that class, e.g. Naive Bayes classifiers, Support Vector Machines (SVM), k Nearest Neighbours (kNN) and many others.
For a numerical representation of your text, you can chose a bag-of-words or a frequency list (essentially, each text is represented by a vector on a high dimensional vectorspace spanned by all the words).
BTW, it is far easier to tag the texts with one keyword (a classification task) than to assign them up to five ones (the number of possible classes explodes combinatorically)

Why does Apache Mahout ItemSimilarity use LP-Space normalization

Why is LP-Space normalization being used for Mahout VectorNormMapper for Item similarity. Have also read that the norm power of 2 works great for CosineSimilarity.
Is there an intuitive explanation of why its being used and how can best values for power be determined for given Similarity class.
Vector norms can be defined for any L_p metric. Different norms have different properties according to which problem you are working on. Common values of p include 1 and 2 with 0 used occasionally.
Certain similarity functions in Mahout are closely related to a particular norm. Your example of the cosine similarity is a good one. The cosine similarity is computed by scaling both vector inputs to have L_2 length = 1 and then taking the dot product. This value is equal to the cosine of the angle between the vectors if the vectors are expressed in Cartesian space. This value is also sqrt(1-d^2) where d is the L_2 norm of the difference between the normalized vectors.
This means that there is an intimate connection between cosine similarity and L_2 distance.
Does that answer your question?
These questions are likely to get answered more quickly on the Apache Mahout mailing lists, btw.

Libsvm: SVM normalizing starts from 0 or 0.001

I am using libsvm for my document classification.
I use svm.h and svm.cc only in my project.
Its struct svm_problem requires array of svm_node that are non-zero thus using sparse.
I get a vector of tf-idf words with lets say in range [5,10]. If i normalize it to [0,1], all the 5's would become 0.
Should i remove these zeroes when sending it to svm_train ?
Does removing these would not reduce the information and lead to poor results ?
should i start the normalization from 0.001 rather than 0 ?
Well, in general, in SVM does normalizing in [0,1] not reduces information ?
SVM is not a Naive Bayes, feature's values are not counters, but dimensions in multidimensional real valued space, 0's have exactly the same amount of information as 1's (which also answers your concern regarding removing 0 values - don't do it). There is no reason to ever normalize data to [0.001, 1] for the SVM.
The only issue here is that column-wise normalization is not a good idea for the tf-idf, as it will degenerate yout features to the tf (as for perticular i'th dimension, tf-idf is simply tf value in [0,1] multiplied by a constant idf, normalization will multiply by idf^-1). I would consider one of the alternative preprocessing methods:
normalizing each dimension, so it has mean 0 and variance 1
decorrelation by making x=C^-1/2*x, where C is data covariance matrix

Resources