I have user profiles with the following attributes.
U={age,sex,country,race}
What is the best way to find similarity between two users?
for example I have following 2 users.
u1={25,M,USA,White}
u2={30,M,UK,black}
I have searched and found Cosine similarity are mentioned a lot. Is it good for my problem or any other suggestions.
Similarity measures between object in clustering analysis is a broad subject.
What I would suggest for You is to consider approach of 'divide and conquer'. Treat similarity between two user profiles as weighted average from all attributes similarity. Just remember to user normalized values for Your attributes similarity before doing avg. Weights for the average should be decided on the data and a use case. If you consider one of the dimension as more important when it match between two profiles it should have more weight in overall result.
For attributes distance You can try: age -> simple Euclidian; sex, race, country -> 0/1. If You have time, distance between two countries can be better defined based on geoloc. or cultural similarity (on e.g.language, religion, political system, GDP,...). But probably experimentation with weights for final average and Your clusters result analysis would give You more payoff ;-)
Related
I have a boolean/binary where a customer and product id are found when the customer actually bought the product and not found if the customer did not buy it. The dataset represented like this:
Dataset
I have tried different approaches like GenericBooleanPrefUserBasedRecommender with TanimotoCoefficient or LogLikelihood similarities, but I have also tried GenericUserBasedRecommender with the Uncentered Cosine Similarity and it gave me the highest precision and recall 100% and 60% respectively.
I am not sure if it makes sense to use the Uncentered Cosine Similarity in this situation, or this is a wrong logic ? and what does the Uncentered Cosine Similairty do with such dataset.
Any ideas would be really appreciated.
Thank you.
100% precision is impossible so something is wrong. All the similarity metrics work fine with boolean data. Remember the space is of very high dimensionality.
Your sample data only has two items (BTW ids should be 0 based for the old hadoop version of Mahout). So the dataset as shown is not going to give valid precision scores.
I've done this with large E-Com datasets and Log-likelihood considerably out-performs the other metrics on boolean data.
BTW Mahout has moved on to Spark from Hadoop and our only metric is LLR. A full Universal Recommender with event store and prediction server based on Mahout-Samsara is implemented here:
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
Slides describing it here: http://www.slideshare.net/pferrel/unified-recommender-39986309
I am using mahout recommenditembased algorithm. What are the differences between all the --similarity Classes available? How to know what is the best choice for my application? These are my choices:
SIMILARITY_COOCCURRENCE
SIMILARITY_LOGLIKELIHOOD
SIMILARITY_TANIMOTO_COEFFICIENT
SIMILARITY_CITY_BLOCK
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION
SIMILARITY_EUCLIDEAN_DISTANCE
What does it mean each one?
I'm not familiar with all of them, but I can help with some.
Cooccurrence is how often two items occur with the same user. http://en.wikipedia.org/wiki/Co-occurrence
Log-Likelihood is the log of the probability that the item will be recommended given the characteristics you are recommending on. http://en.wikipedia.org/wiki/Log-likelihood
Not sure about tanimoto
City block is the distance between two instances if you assume you can only move around like you're in a checkboard style city. http://en.wikipedia.org/wiki/Taxicab_geometry
Cosine similarity is the cosine of the angle between the two feature vectors. http://en.wikipedia.org/wiki/Cosine_similarity
Pearson Correlation is covariance of the features normalized by their standard deviation. http://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Euclidean distance is the standard straight line distance between two points. http://en.wikipedia.org/wiki/Euclidean_distance
To determine which is the best for you application you most likely need to have some intuition about your data and what it means. If your data is continuous value features than something like euclidean distance or pearson correlation makes sense. If you have more discrete values than something along the lines of city block or cosine similarity may make more sense.
Another option is to set up a cross-validation experiment where you see how well each similarity metric works to predict the desired output values and select the metric that works the best from the cross-validation results.
Tanimoto and Jaccard are similars, is a statistic used for comparing the similarity and diversity of sample sets.
https://en.wikipedia.org/wiki/Jaccard_index
I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!
What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.
Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.
In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.
I'm trying to determine the similarity between pairs of items taken among a large collection. The items have several attributes and I'm able to calculate a discrete similarity score for each attribute, between 0 and 1. I use various classifiers depending on the attribute: TF-IDF cosine similarity, Naive Bayes Classifier, etc.
I'm stuck when it comes to compiling all that information into a final similarity score for all the items. I can't just take an unweighted average because 1) what's a high score depends on the classifier and 2) some classifiers are more important than others. In addition, some classifiers should be considered only for their high scores, i.e. a high score points to a higher similarity but lower scores have no meaning.
So far I've calculated the final score with guesswork but the increasing number of classifiers makes this a very poor solution. What techniques are there to determine an optimal formula that will take my various scores and return just one? It's important to note that the system does receive human feedback, which is how some of the classifiers work to begin with.
Ultimately I'm only interested in ranking, for each item, the ones that are most similar. The absolute scores themselves are meaningless, only their ordering is important.
There is a great book on the topic of ensemble classifier. It is online on: Combining Pattern Classifiers
There are two chapters (ch4 & ch5) in this book on Fusion of Label Outputs and how to get a single decision value.
A set of methods are defined in the chapter including:
1- Weighted Majority Vote
2- Naive Bayes Combination
3- ...
I hope that this is what you were looking for.
Get a book on ensemble classification. There has been a lot of work on how to learn a good combination of classifiers. There are numerous choices. You can of course learn weights and do a weighted average. Or you can use error correcting codes. etc. pp.
Anyway, read up on "ensemble classification", that is the keyword you need.
We're trying to find similarity between items (and later users) where the items are ranked in various lists by users (think Rob, Barry and Dick in Hi Fidelity). A lower index in a given list implies a higher rating.
I suppose a standard approach would be to use the Pearson correlation and then invert the indexes in some way.
However, as I understand it, the aim of the Pearson correlation is to compensate for differences between users who typically rate things higher or lower but have a similar relative ratings.
It seems to me that if the lists are continuous (although of arbitrary length) it's not an issue that the ratings implied from the position will be skewed in this way.
I suppose in this case a Euclidean based similarity would suffice. Is this the case? Would using the Pearson correlation have a negative effect and find correlation that isn't appropriate? What similarity measure might best suit this data?
Additionally while we want position in the list to have effect we don't want to penalise rankings that are too far apart. Two users both featuring an item in a list with very differing ranking should still be considered similar.
Jaccard Similarity looks better in your case. To include the rank you mentioned, you can take a bag-of-items approach.
Using your example of (Rob, Barry, Dick) with their rating being (3,2,1) respectively, you insert Rob 3 times into this user a's bag.
Rob, Rob, Rob.
Then for Barry, you do it twice. The current bag looks like below,
Rob, Rob, Rob, Barry, Barry.
You put Dick into the bag finally.
Rob, Rob, Rob, Barry, Barry, Dick
Suppose another user b has a bag of [Dick, Dick, Barry], you calculate the Jaccard Similarity as below:
The intersection between a and b = [Dick, Barry]
The union of a and b = [Rob, Rob, Rob, Barry, Barry, Dick, Dick]
The Jaccard Similarity = 2/7,
that is, the number of items in the intersection divided by the number of items in the union.
This similarity measure does NOT penalize rankings that are far apart. You can see that:
Two users both featuring an item in a list with very differing ranking should still be considered similar.
The most well-known similarity metric based only on ranking is Spearman's correlation. It just assigns "1" to the first item, "2" to the second and so on and computes a (Pearson) correlation coefficient. (You can make the values descending too, which is more intuitive -- won't matter to Pearson's correlation.)
Spearman's correlation is implemented in the project, but, that said I do not think it is very useful.
Tau rank is a more principled measure of how much ranked lists match, but it's not implemented. It would not be hard.