We're trying to find similarity between items (and later users) where the items are ranked in various lists by users (think Rob, Barry and Dick in Hi Fidelity). A lower index in a given list implies a higher rating.
I suppose a standard approach would be to use the Pearson correlation and then invert the indexes in some way.
However, as I understand it, the aim of the Pearson correlation is to compensate for differences between users who typically rate things higher or lower but have a similar relative ratings.
It seems to me that if the lists are continuous (although of arbitrary length) it's not an issue that the ratings implied from the position will be skewed in this way.
I suppose in this case a Euclidean based similarity would suffice. Is this the case? Would using the Pearson correlation have a negative effect and find correlation that isn't appropriate? What similarity measure might best suit this data?
Additionally while we want position in the list to have effect we don't want to penalise rankings that are too far apart. Two users both featuring an item in a list with very differing ranking should still be considered similar.
Jaccard Similarity looks better in your case. To include the rank you mentioned, you can take a bag-of-items approach.
Using your example of (Rob, Barry, Dick) with their rating being (3,2,1) respectively, you insert Rob 3 times into this user a's bag.
Rob, Rob, Rob.
Then for Barry, you do it twice. The current bag looks like below,
Rob, Rob, Rob, Barry, Barry.
You put Dick into the bag finally.
Rob, Rob, Rob, Barry, Barry, Dick
Suppose another user b has a bag of [Dick, Dick, Barry], you calculate the Jaccard Similarity as below:
The intersection between a and b = [Dick, Barry]
The union of a and b = [Rob, Rob, Rob, Barry, Barry, Dick, Dick]
The Jaccard Similarity = 2/7,
that is, the number of items in the intersection divided by the number of items in the union.
This similarity measure does NOT penalize rankings that are far apart. You can see that:
Two users both featuring an item in a list with very differing ranking should still be considered similar.
The most well-known similarity metric based only on ranking is Spearman's correlation. It just assigns "1" to the first item, "2" to the second and so on and computes a (Pearson) correlation coefficient. (You can make the values descending too, which is more intuitive -- won't matter to Pearson's correlation.)
Spearman's correlation is implemented in the project, but, that said I do not think it is very useful.
Tau rank is a more principled measure of how much ranked lists match, but it's not implemented. It would not be hard.
Related
I'm looking for a similarity measure (like the Jaccard Index) but I want to use known similarities between objects within the set, and weigh the connections by the item abundances. These known similarities are scores between 0 and 1, 1 indicating an exact match.
For example, consider two sets:
SET1 {A,B,C} and SET2 {A',B',C'}
I know that
{A,A'}, {B,B'}, {C,C'} each have an item similarity of 0.9. Hence, I would expect the similarity of SET1 and SET2 to be relatively high.
Another example would be: consider two sets SET1 {A,B,C} and SET2 {A,B',C',D,E,F,.....,Z}. Although the matches between the first three items are higher than in the first example, this score should likely be lower because of the size difference (as in Jaccard).
One more issue here is how to use abundances as weights, but I've got no idea as how to solve this.
In general, I need a normalized set similarity measure that takes into account this item similarity and abundancy.
Correct me if I'm wrong but I guess you need clustering error as similarity measure. It is the proportion of points which are clustered differently in A' and A after an optimal matching of clusters. In other words, it is the
scaled sum of the non-diagonal elements of the confusion matrix, minimized
over all possible permutations of rows and columns. It uses the Hungarian algorithm to avoid high computational cost and it penalizes different number of elements in sets.
Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.
I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.
My dataset is composed of millions of row and a couple (10's) of features.
One feature is a label composed of 1000 differents values (imagine each row is a user and this feature is the user's firstname :
Firstname,Feature1,Feature2,....
Quentin,1,2
Marc,0,2
Gaby,1,0
Quentin,1,0
What would be the best representation for this feature (to perform clustering) :
I could convert the data as integer using a LabelEncoder, but it doesn't make sense here since there is no logical "order" between two differents label
Firstname,F1,F2,....
0,1,2
1,0,2
2,1,0
0,1,0
I could split the feature in 1000 features (one for each label) with 1 when the label match and 0 otherwise. However this would result in a very big matrix (too big if I can't use sparse matrix in my classifier)
Quentin,Marc,Gaby,F1,F2,....
1,0,0,1,2
0,1,0,0,2
0,0,1,1,0
1,0,0,1,0
I could represent the LabelEncoder value as a binary in N columns, this would reduce the dimension of the final matrix compared to the previous idea, but i'm not sure of the result :
LabelEncoder(Quentin) = 0 = 0,0
LabelEncoder(Marc) = 1 = 0,1
LabelEncoder(Gaby) = 2 = 1,0
A,B,F1,F2,....
0,0,1,2
0,1,0,2
1,0,1,0
0,0,1,0
... Any other idea ?
What do you think about solution 3 ?
Edit for some extra explanations
I should have mentioned in my first post, but In the real dataset, the feature is the more like the final leaf of a classification tree (Aa1, Aa2 etc. in the example - it's not a binary tree).
A B C
Aa Ab Ba Bb Ca Cb
Aa1 Aa2 Ab1 Ab2 Ab3 Ba1 Ba2 Bb1 Bb2 Ca1 Ca2 Cb1 Cb2
So there is a similarity between 2 terms under the same level (Aa1 Aa2 and Aa3are quite similar, and Aa1 is as much different from Ba1 than Cb2).
The final goal is to find similar entities from a smaller dataset : We train a OneClassSVM on the smaller dataset and then get a distance of each term of the entiere dataset
This problem is largely one of one-hot encoding. How do we represent multiple categorical values in a way that we can use clustering algorithms and not screw up the distance calculation that your algorithm needs to do (you could be using some sort of probabilistic finite mixture model, but I digress)? Like user3914041's answer, there really is no definite answer, but I'll go through each solution you presented and give my impression:
Solution 1
If you're converting the categorical column to an numerical one like you mentioned, then you face that pretty big issue you mentioned: you basically lose meaning of that column. What does it really even mean if Quentin in 0, Marc 1, and Gaby 2? At that point, why even include that column in the clustering? Like user3914041's answer, this is the easiest way to change your categorical values into numerical ones, but they just aren't useful, and could perhaps be detrimental to the results of the clustering.
Solution 2
In my opinion, depending upon how you implement all of this and your goals with the clustering, this would be your best bet. Since I'm assuming you plan to use sklearn and something like k-Means, you should be able to use sparse matrices fine. However, like imaluengo suggests, you should consider using a different distance metric. What you can consider doing is scaling all of your numeric features to the same range as the categorical features, and then use something like cosine distance. Or a mix of distance metrics, like I mention below. But all in all this will likely be the most useful representation of your categorical data for your clustering algorithm.
Solution 3
I agree with user3914041 in that this is not useful, and introduces some of the same problems as mentioned with #1 -- you lose meaning when two (probably) totally different names share a column value.
Solution 4
An additional solution is to follow the advice of the answer here. You can consider rolling your own version of a k-means-like algorithm that takes a mix of distance metrics (hamming distance for the one-hot encoded categorical data, and euclidean for the rest). There seems to be some work in developing k-means like algorithms for mixed categorical and numerical data, like here.
I guess it's also important to consider whether or not you need to cluster on this categorical data. What are you hoping to see?
Solution 3:
I'd say it has the same kind of drawback as using a 1..N encoding (solution 1), in a less obvious fashion. You'll have names that both give a 1 in some column, for no other reason than the order of the encoding...
So I'd recommend against this.
Solution 1:
The 1..N solution is the "easy way" to solve the format issue, as you noted it's probably not the best.
Solution 2:
This looks like it's the best way to do it but it is a bit cumbersome and from my experience the classifier does not always performs very well with a high number of categories.
Solution 4+:
I think the encoding depends on what you want: if you think that names that are similar (like John and Johnny) should be close, you could use characters-grams to represent them. I doubt this is the case in your application though.
Another approach is to encode the name with its frequency in the (training) dataset. In this way what you're saying is: "Mainstream people should be close, whether they're Sophia or Jackson does not matter".
Hope the suggestions help, there's no definite answer to this so I'm looking forward to see what other people do.
I have user profiles with the following attributes.
U={age,sex,country,race}
What is the best way to find similarity between two users?
for example I have following 2 users.
u1={25,M,USA,White}
u2={30,M,UK,black}
I have searched and found Cosine similarity are mentioned a lot. Is it good for my problem or any other suggestions.
Similarity measures between object in clustering analysis is a broad subject.
What I would suggest for You is to consider approach of 'divide and conquer'. Treat similarity between two user profiles as weighted average from all attributes similarity. Just remember to user normalized values for Your attributes similarity before doing avg. Weights for the average should be decided on the data and a use case. If you consider one of the dimension as more important when it match between two profiles it should have more weight in overall result.
For attributes distance You can try: age -> simple Euclidian; sex, race, country -> 0/1. If You have time, distance between two countries can be better defined based on geoloc. or cultural similarity (on e.g.language, religion, political system, GDP,...). But probably experimentation with weights for final average and Your clusters result analysis would give You more payoff ;-)
For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.